Skip to content

Add SSE2 dec_loop.c #46

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Add SSE2 dec_loop.c #46

wants to merge 1 commit into from

Conversation

aqrit
Copy link

@aqrit aqrit commented Nov 25, 2017

Use saturation for range check.

This commit is wholly dead code. In the future, it would be useful anywhere the PSHUFB instruction is slow (#44) or not supported.

Use saturation for range check.

This commit is wholly dead code.  But it would be useful, in the future,
anywhere the PSHUFB instruction is slow or not supported.
@aqrit
Copy link
Author

aqrit commented Nov 25, 2017

I'm unsure about the best way to hook it up.

I would advocate for a minimalist cpu detection library (#13)
maybe the one from the x265 project?

@aklomp
Copy link
Owner

aklomp commented Nov 12, 2019

What's interesting is that SSE2 is a mandatory part of the AMD64 architecture, so we could use this codec as a fallback instead of the plain decoder when we detect at compile time that we're compiling for AMD64.

As for CPU detection, I want to tackle this in the future, but was thinking about using GCC's builtin-cpu family of functions instead of maintaining something myself.

}

// Pack 16 bytes into the 12-byte output format and store back:
_mm_storeu_si128((__m128i *)o, dec_reshuffle(out));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dec_reshuffle function appears to be undefined.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was meant for atom CPUs
so the ssse3 dec_reshuffle() would have been used.

I tried to do a sse2 packer here
but it is not too great.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to get the code working with the SSSE3 decoder to get an idea of its performance, but hit a small bug: the function call should be dec_reshuffle(m0).

If the assumption is that this code would be used in conjunction with the SSSE3 reshuffler, then why not use the full SSSE3 codec?

One possible answer comes when testing this on a 32-bit Atom N270. There is a large improvement over the stock SSSE3 decoder: 591 MB/s for the SSE2 code against 427 MB/s for the SSSE3 code.

It's significantly slower on my desktop i5-4590S though: 6150 MB/s against 6800 MB/s for the stock SSSE3 codec. This makes it hard to justify merging this code for common use.

Maybe GCC's __builtin_cpu_is("atom") builtin would be useful here...

@htot
Copy link
Contributor

htot commented Nov 15, 2019

As for CPU detection, I want to tackle this in the future, but was thinking about using GCC's builtin-cpu family of functions instead of maintaining something myself.

We might use CPU detection to tweak the algorithm for performance as well, see benchmarks.

@aqrit
Copy link
Author

aqrit commented Nov 15, 2019

I wonder if the new packing code is slower on old (re: atom) hardware as well.
It trades out logic/shift for mul.

Does someone want to try it with the old dec_reshuffle()

Also Atom should probably be using aligned loads.

edit: this is being built 64-bit right?


I assume ppl have seen the new competition?
https://github.com/powturbo/TurboBase64/blob/master/index.md
https://github.com/WojciechMula/base64-avx512

@aklomp
Copy link
Owner

aklomp commented Nov 15, 2019

I did see the new breakthroughs by @WojciechMula and @lemire, it's what piqued my interest in working on this library again. Their AVX512 codec is impressive work. I'd like to add it to this library some day, but the problem is that it currently requires exotic hardware to test and verify the code. So it's something for the longer term.

I did see @powturbo's code. While their plain codec is indeed faster than our plain codec, their other claims don't really hold water. I did the benchmarks myself and SIMD still wins in the end. Notice that our NEON codecs are mysteriously missing from their benchmarks, and the AVX2 codecs only get a mention but no numbers.

@aqrit
Copy link
Author

aqrit commented Nov 15, 2019

this patch needs tweaking for efficient operation with only 8 xmm registers

@htot
Copy link
Contributor

htot commented Nov 16, 2019

@powturbo benchmarks say our decoding is slower than our coding, the opposite to our own benchmarks. Somewhere something must be wrong.

@aklomp
Copy link
Owner

aklomp commented Nov 16, 2019

I pushed a quick-and-dirty branch called turbobase64 which integrates the TurboBase64 codec into our benchmarks. You can build it yourself or see some results in the CI here. I didn't hook their codec up to our test suite yet.

The result: their plain codec is faster than ours (if it also does all necessary validity checks), but is no match for the SIMD codecs by a long shot. Don't believe everything you read on the Internet.

Agreed about the decoding anomaly, our decoder should be faster than the encoder. I'd like to see a reproducible process to generate those benchmarks.

@powturbo
Copy link

powturbo commented Nov 17, 2019

Well, TurboBase64 is a small experiment existing since 2016.
It's strange that you've not done any attention since there.
The fastbase people have also tested my library and not complained about
the results.
Nobody has included turbobase64 in benchmarks since there.

I've done primarly the benchmarks against fastbase64 that included
other base64 functions form chromium, apple quicktime, linux,...
At the end, I've included the SSE benchmark from your library.
Recently, I've adapted turbobase64 to ARM and done some benchmarks
against the Neon implementation in fastbase64.
I've not redone all other benchmarks (same cpu). As your library seems
not maitained since a long time, I'm not aware of any improvement done.
At the time of benchmarking the results where 100% correct and effectively scalar was faster
than other scalar libraries and decoding even faster that SSE with the hardware and compilers at this time.
You are invited to verify the benchmark with the same conditions.
The goal is not to propose a library faster than everything,
but to show that scalar can also be improved to be possibly competitive with or faster
than other SIMD implementations.
I can simply include the famous SIMD work of Wojciech Muła, if I want to
add other SSE, AVX2 functions to turbobase64.
Detecting cpu used and switching between the functions is a trivial work.

Edit: Your CI results are showing, turbobase64 is decoding 2x faster than Base64 SIMD Neon.

@aklomp
Copy link
Owner

aklomp commented Nov 17, 2019

Hi @powturbo, nice to see you've joined the discussion! You've done some impressive work on your library. Your plain codec is a speed monster, and I'm a bit envious of it actually. I just think you're overselling its speed a bit when compared to codecs that use a SIMD approach. It's like saying you have the fastest bicycle in a world where motorcycles exist. It's still a great achievement, and it has the potential to become a very portable, "fast everywhere" library.

I was unaware of your codec until your comments on HN some weeks ago. Then again, as you point out, I haven't been very active in the Base64 field since 2016. Many other hobbies to explore and lives to lead, etc. I've decided to start working on this library again because I left a bunch of loose ends a few years ago and I want to clean them up as best I can, so that I can hopefully "complete" this library and move on in good conscience.

I did notice that your decoder does not check the validity of the input. Not every string is a valid Base64 string. That's perhaps fine for the library's intended use cases, but it's also a shortcut that makes it easier to create a fast table-based algorithm. I was unable to hook up your decoder to our test suite because it does not output such an "input is bad" flag.

As for the issue of benchmarking, it's interesting that your benchmark concludes that our decoding is slower than our encoding. Normally it's the other way around, and when looking at the code it should be obvious why. I'd like to see why this is, and rerun the benchmarks, but I can't seem to find your benchmarking code.

You're right about the situation with Neon64. I noticed that myself when I added cross-platform CI builds a few days ago. (I don't own a Neon64-capable device that I can test on.) I'd like to know what hardware Travis CI uses for their Aarch64 machines. It's possible that it's not even real ARM hardware, but emulation, in which case the results will naturally be poor. Note that the Neon32 benchmarks are for sure run through Qemu (by our CI script), so they are not indicative of actual performance.

If true, the Neon64 numbers are worrying, because those codecs apparently have very good performance on the iPhone. Maybe the performance hit is due to poor code generation by the compiler, or something else like register pressure. I'm considering parts of the codec in inline assembler to see if it makes a difference. I'm also considering buying a cheap Aarch64 board so that I have something to test on. Until that time, your library looks to be much faster on that platform.

@htot
Copy link
Contributor

htot commented Nov 17, 2019

I'd like to know what hardware Travis CI uses for their Aarch64 machines.

lxd-arm64 → the build ran within an LXD container on Arm64-based infrastructure (currently delivered by Packet) acc. to docs

@mayeut
Copy link
Contributor

mayeut commented Nov 17, 2019

@aklomp, the poor performances in NEON are probably due to the differences in architectures between an iPhone (very good at lookups and vldX/vstX in NEON) and the one used by travis-ci (I don't know if they use ThunderX or Ampere chips, but I know that ThunderX is bad for those operations). libjpeg-turbo uses different implementations depending on the actual architecture.

@powturbo
Copy link

powturbo commented Nov 19, 2019

Hi @aklomp klomp,

thank you.

I'm only interpreting the benchmark numbers and effectively other SSE base64
at the benchmark time wasn't so fast.
Scalar was faster, but you may get other results depending on the hardware used,
because the difference was not so big (see benchmarks).
I've also benchmarked base64simd and the results were similar.

It's not unusual to have scalar beating (unoptimized) sse and even avx2
(see ex. https://github.com/powturbo/TurboHist ).

As stated in the readme, the benchmarks are done with turbobench.
1 - clone turbobench
2 - Download base64, turbobase64 and fastbase64 into the turbobench directory.
3 - Compile base64 library with the provided base64 makefile.
(output libbase64.o)
4 - Type "make BASE64=1" under the turbobench directory
5 - Type "./turbobench -eturbob64/b64_ssse3/b64_sse41 file -I15 -J15"
When, you use the base64 benchmarked version (I think beginn 2017), you must see that
decoding is a lot slower than encoding for base64 sse.
You can verify this also with your own benchmark tools.

travis is using (Ampere eMAG) packet ARM servers
See Review

@powturbo
Copy link

Hi,
I've benchmarked now base64 using the latest version including avx2.
In my benchmarks with the cpu used, sse encoding is still faster than decoding.

@htot
Copy link
Contributor

htot commented Nov 21, 2019

Now doubt there is a difference in the measurement method. We already found there are bounds depending on buffer size, threads used, CPU type, memory speed. I imagine that with your benchmark the effect of the cache is eliminated (large buffer, random data). This might be comparable to Edison, which has no cache at all. And indeed with Edison SSSE3 encoding is faster then decoding. See (my slightly outdated) benchmarks. If so, turning on multi-threading will speed up plain, but not change SSSE3 and AVX2.

@mayeut
Copy link
Contributor

mayeut commented Nov 21, 2019

Now doubt there is a difference in the measurement method.

Sure, base64 benchmark uses input size for both function (decoded/raw byte count for encode, encoded byte count for decode) while the turbobase64 bench seems to always use the decoded/raw byte count.

@aklomp
Copy link
Owner

aklomp commented Nov 21, 2019

@powturbo Thanks for doing the benchmarks. The difference between the encoder and decoder is more in line now with what I would expect. If you ran the tests on a fresh checkout, it may be because of some codec improvements.

I'm still impressed by the speed of your plain codec. I've been playing with my own 12-bit lookup table code over the past few days and found it impossible to beat yours, even when I pull out all the tricks like loop unrolling. I'll push that code soon because halving the number of table lookups makes a massive performance difference.

The benchmarking in this library was always a bit of an afterthought. The benchmarking program is quickly thrown together without much regard to eliminating forms of bias, or much thought about what to measure to get meaningful and repeatable results. It should probably be on the longlist of things that need work. The variance in the results can be frustrating. Sometimes it's hard to tell whether a change is really an improvement or a regression.

Codec performance varies enormously across processors, that's for sure. Code that is fast on one architecture is slow on another, as demonstrated most clearly by the NEON64 code. I have plans to add a new optional codec_choose backend which uses GCC's __builtin_cpu runtime detection to determine x86 architecture at runtime. That should make it possible to have fine-grained control over which codec runs where. (And this closes the circle with the topic of this PR :)

(In other news, I just released v0.4.0. It's a release of the status quo, so that existing users can anchor to this version while we have the freedom to make potentially breaking changes.)

@aqrit aqrit mentioned this pull request Jun 23, 2022
@aqrit aqrit closed this Jun 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants