IDWT 5x3 single-pass lifting and SSE2/AVX2 implementation#957
IDWT 5x3 single-pass lifting and SSE2/AVX2 implementation#957rouault merged 6 commits intouclouvain:masterfrom
Conversation
* Use single-pass lifting inverse wavelet transform. * For vertical pass, use SSE2 when available so as to process 8 columns in parallel. This is the most beneficial improvement, since the vertical pass involves a lot of cache trashing. With the bench_dwt utility with default arguments (16383x16383 image), time goes from 4.064 s to 1.212 s.
Thanks to our macros that abstract SSE use, the functions can use AVX2 when available (at compile time) This brings an extra 23% speed improvement on bench_dwt in 64bit builds with AVX2 compared to SSE2.
…able tests since Travis doesn't have AVX2 compatible machines)
|
Note: the failure in AppVeyor is a network flake. Passes on the same commit pushed to my account: https://ci.appveyor.com/project/rouault/openjpeg/build/2.1.1.15 |
|
Results on opj_decompress time on 8c05f00a-ae05-4dd5-bdc7-a1b5eed4ebfb.jp2 from testovani : 15595 wide x 11128 tall x 3 components idwt_53_improvements branch, SSE2 : 48.698s So a global decrease of 12.6% (7.061 s) from master to idwt_53_improvements branch in SSE2, and an extra decrease 1.3% from SSE2 to AVX2 in idwt_53_improvements branch |
Implements #953
With the new bench_dwt utility, on x86_64:
SSE2/AVX2 is used in the vertical pass to handle several columns at the same time. This avoids a lot of CPU cache trashing.
Note: I tried a SSE2 optimized version of opj_idwt53_h_cas0() but the gain is almost unnoticeable, so not included in this PR.