New AVX2 implementation of FIR_1x4() achieves 24 FLOPS per clock on HSW/BDW/SKL Removed old AVX implementation 32-byte alignment of SIMD buffers and tables Fixed the poorly optimized code produced by GCC -O2
int16_t and float versions. Removed unused SSE2 version. Removed unused getExactInput(). Refactored to remove nested #ifdefs.