IDWT 5x3 single-pass lifting and SSE2/AVX2 implementation by rouault · Pull Request #957 · uclouvain/openjpeg

rouault · 2017-06-21T11:36:54Z

Implements #953

With the new bench_dwt utility, on x86_64:

before changes: 3.356 s
with SSE2 optimization (default for a x86_64): 0.992 s
with AVX2 optimization (requested at compilation time): 0.744 s

SSE2/AVX2 is used in the vertical pass to handle several columns at the same time. This avoids a lot of CPU cache trashing.
Note: I tried a SSE2 optimized version of opj_idwt53_h_cas0() but the gain is almost unnoticeable, so not included in this PR.

* Use single-pass lifting inverse wavelet transform. * For vertical pass, use SSE2 when available so as to process 8 columns in parallel. This is the most beneficial improvement, since the vertical pass involves a lot of cache trashing. With the bench_dwt utility with default arguments (16383x16383 image), time goes from 4.064 s to 1.212 s.

Thanks to our macros that abstract SSE use, the functions can use AVX2 when available (at compile time) This brings an extra 23% speed improvement on bench_dwt in 64bit builds with AVX2 compared to SSE2.

…able tests since Travis doesn't have AVX2 compatible machines)

rouault · 2017-06-21T12:24:02Z

Note: the failure in AppVeyor is a network flake. Passes on the same commit pushed to my account: https://ci.appveyor.com/project/rouault/openjpeg/build/2.1.1.15

rouault · 2017-06-26T10:45:20Z

Results on opj_decompress time on 8c05f00a-ae05-4dd5-bdc7-a1b5eed4ebfb.jp2 from testovani : 15595 wide x 11128 tall x 3 components

idwt_53_improvements branch, SSE2 : 48.698s
idwt_53_improvements branch, AVX2 : 48.050s
master branch, SSE2: 55.759s
master branch, AVX2: 55.294s

So a global decrease of 12.6% (7.061 s) from master to idwt_53_improvements branch in SSE2, and an extra decrease 1.3% from SSE2 to AVX2 in idwt_53_improvements branch
Note: the SSE2->AVX2 improvement here is composed of a gain of recompiling the whole code base in AVX2 (55.759 - 55.294 = 465 ms) + a specific improvement due to the IDWT5x3 AVX2 optimization ( 48.698 - 48.050 - 0.465 = 183 ms)

rouault added 6 commits June 20, 2017 17:56

Add bench_dwt program (compiled only if BUILD_BENCH_DWT=ON)

919ed5f

Enable __SSE__ / __SSE2__ with Visual Studio

f06cfad

dwt.c: small cleanup

f6e3475

IDWT 5x3: generalize SSE2 version for AVX2

fd0dc53

Thanks to our macros that abstract SSE use, the functions can use AVX2 when available (at compile time) This brings an extra 23% speed improvement on bench_dwt in 64bit builds with AVX2 compared to SSE2.

.travis.yml: add a configuration to test compilation of AVX2 (but dis…

4fe7620

…able tests since Travis doesn't have AVX2 compatible machines)

rouault requested review from CharlesBuysschaertIntopix and detonin June 21, 2017 11:36

rouault merged commit 533fa2f into uclouvain:master Jun 26, 2017

This was referenced Jun 26, 2017

Port single-pass & SSE2/AVX2 optimizations of IDWT 5x3 to forward DWT 5x3 (compression) #959

Open

SSE2 optimization for horizontal pass of IDWT 5x3 #960

Open

Dynamic switch at runtime between SSE2 and AVX2 optim of IDWT 5x3 #961

Open

rouault added a commit that referenced this pull request Jun 29, 2017

IDWT 5x3: fix bug in AVX2 implementation (#953, #957)

8fa405e

detonin added the enhancement label Aug 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IDWT 5x3 single-pass lifting and SSE2/AVX2 implementation#957

IDWT 5x3 single-pass lifting and SSE2/AVX2 implementation#957
rouault merged 6 commits intouclouvain:masterfrom
rouault:idwt_53_improvements

rouault commented Jun 21, 2017

Uh oh!

rouault commented Jun 21, 2017

Uh oh!

rouault commented Jun 26, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rouault commented Jun 21, 2017

Uh oh!

rouault commented Jun 21, 2017

Uh oh!

rouault commented Jun 26, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants