New AVX2 implementation of FIR_1x4() achieves 24 FLOPS per clock on HSW/BDW/SKL Removed old AVX implementation 32-byte alignment of SIMD buffers and tables Fixed the poorly optimized code produced by GCC -O2