Skip to content

IDWT 5x3 single-pass lifting and SSE2/AVX2 implementation#957

Merged
rouault merged 6 commits intouclouvain:masterfrom
rouault:idwt_53_improvements
Jun 26, 2017
Merged

IDWT 5x3 single-pass lifting and SSE2/AVX2 implementation#957
rouault merged 6 commits intouclouvain:masterfrom
rouault:idwt_53_improvements

Conversation

@rouault
Copy link
Copy Markdown
Collaborator

@rouault rouault commented Jun 21, 2017

Implements #953

With the new bench_dwt utility, on x86_64:

  • before changes: 3.356 s
  • with SSE2 optimization (default for a x86_64): 0.992 s
  • with AVX2 optimization (requested at compilation time): 0.744 s

SSE2/AVX2 is used in the vertical pass to handle several columns at the same time. This avoids a lot of CPU cache trashing.
Note: I tried a SSE2 optimized version of opj_idwt53_h_cas0() but the gain is almost unnoticeable, so not included in this PR.

rouault added 6 commits June 20, 2017 17:56
* Use single-pass lifting inverse wavelet transform.
* For vertical pass, use SSE2 when available so as to process 8 columns
  in parallel. This is the most beneficial improvement, since the
  vertical pass involves a lot of cache trashing.

With the bench_dwt utility with default arguments (16383x16383 image),
time goes from 4.064 s to 1.212 s.
Thanks to our macros that abstract SSE use, the functions can use
AVX2 when available (at compile time)

This brings an extra 23% speed improvement on bench_dwt in 64bit builds
with AVX2 compared to SSE2.
…able tests since Travis doesn't have AVX2 compatible machines)
@rouault
Copy link
Copy Markdown
Collaborator Author

rouault commented Jun 21, 2017

Note: the failure in AppVeyor is a network flake. Passes on the same commit pushed to my account: https://ci.appveyor.com/project/rouault/openjpeg/build/2.1.1.15

@rouault
Copy link
Copy Markdown
Collaborator Author

rouault commented Jun 26, 2017

Results on opj_decompress time on 8c05f00a-ae05-4dd5-bdc7-a1b5eed4ebfb.jp2 from testovani : 15595 wide x 11128 tall x 3 components

idwt_53_improvements branch, SSE2 : 48.698s
idwt_53_improvements branch, AVX2 : 48.050s
master branch, SSE2: 55.759s
master branch, AVX2: 55.294s

So a global decrease of 12.6% (7.061 s) from master to idwt_53_improvements branch in SSE2, and an extra decrease 1.3% from SSE2 to AVX2 in idwt_53_improvements branch
Note: the SSE2->AVX2 improvement here is composed of a gain of recompiling the whole code base in AVX2 (55.759 - 55.294 = 465 ms) + a specific improvement due to the IDWT5x3 AVX2 optimization ( 48.698 - 48.050 - 0.465 = 183 ms)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants