-
Notifications
You must be signed in to change notification settings - Fork 167
Add SSE2 dec_loop.c #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Use saturation for range check. This commit is wholly dead code. But it would be useful, in the future, anywhere the PSHUFB instruction is slow or not supported.
I'm unsure about the best way to hook it up. I would advocate for a minimalist cpu detection library (#13) |
What's interesting is that SSE2 is a mandatory part of the AMD64 architecture, so we could use this codec as a fallback instead of the plain decoder when we detect at compile time that we're compiling for AMD64. As for CPU detection, I want to tackle this in the future, but was thinking about using GCC's builtin-cpu family of functions instead of maintaining something myself. |
} | ||
|
||
// Pack 16 bytes into the 12-byte output format and store back: | ||
_mm_storeu_si128((__m128i *)o, dec_reshuffle(out)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dec_reshuffle
function appears to be undefined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was meant for atom CPUs
so the ssse3 dec_reshuffle()
would have been used.
I tried to do a sse2 packer here
but it is not too great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to get the code working with the SSSE3 decoder to get an idea of its performance, but hit a small bug: the function call should be dec_reshuffle(m0)
.
If the assumption is that this code would be used in conjunction with the SSSE3 reshuffler, then why not use the full SSSE3 codec?
One possible answer comes when testing this on a 32-bit Atom N270. There is a large improvement over the stock SSSE3 decoder: 591 MB/s for the SSE2 code against 427 MB/s for the SSSE3 code.
It's significantly slower on my desktop i5-4590S though: 6150 MB/s against 6800 MB/s for the stock SSSE3 codec. This makes it hard to justify merging this code for common use.
Maybe GCC's __builtin_cpu_is("atom")
builtin would be useful here...
We might use CPU detection to tweak the algorithm for performance as well, see benchmarks. |
Does someone want to try it with the old Also Atom should probably be using aligned loads. edit: this is being built 64-bit right? I assume ppl have seen the new competition? |
I did see the new breakthroughs by @WojciechMula and @lemire, it's what piqued my interest in working on this library again. Their AVX512 codec is impressive work. I'd like to add it to this library some day, but the problem is that it currently requires exotic hardware to test and verify the code. So it's something for the longer term. I did see @powturbo's code. While their plain codec is indeed faster than our plain codec, their other claims don't really hold water. I did the benchmarks myself and SIMD still wins in the end. Notice that our NEON codecs are mysteriously missing from their benchmarks, and the AVX2 codecs only get a mention but no numbers. |
this patch needs tweaking for efficient operation with only 8 xmm registers |
@powturbo benchmarks say our decoding is slower than our coding, the opposite to our own benchmarks. Somewhere something must be wrong. |
I pushed a quick-and-dirty branch called turbobase64 which integrates the TurboBase64 codec into our benchmarks. You can build it yourself or see some results in the CI here. I didn't hook their codec up to our test suite yet. The result: their plain codec is faster than ours (if it also does all necessary validity checks), but is no match for the SIMD codecs by a long shot. Don't believe everything you read on the Internet. Agreed about the decoding anomaly, our decoder should be faster than the encoder. I'd like to see a reproducible process to generate those benchmarks. |
Well, TurboBase64 is a small experiment existing since 2016. I've done primarly the benchmarks against fastbase64 that included Edit: Your CI results are showing, turbobase64 is decoding 2x faster than Base64 SIMD Neon. |
Hi @powturbo, nice to see you've joined the discussion! You've done some impressive work on your library. Your plain codec is a speed monster, and I'm a bit envious of it actually. I just think you're overselling its speed a bit when compared to codecs that use a SIMD approach. It's like saying you have the fastest bicycle in a world where motorcycles exist. It's still a great achievement, and it has the potential to become a very portable, "fast everywhere" library. I was unaware of your codec until your comments on HN some weeks ago. Then again, as you point out, I haven't been very active in the Base64 field since 2016. Many other hobbies to explore and lives to lead, etc. I've decided to start working on this library again because I left a bunch of loose ends a few years ago and I want to clean them up as best I can, so that I can hopefully "complete" this library and move on in good conscience. I did notice that your decoder does not check the validity of the input. Not every string is a valid Base64 string. That's perhaps fine for the library's intended use cases, but it's also a shortcut that makes it easier to create a fast table-based algorithm. I was unable to hook up your decoder to our test suite because it does not output such an "input is bad" flag. As for the issue of benchmarking, it's interesting that your benchmark concludes that our decoding is slower than our encoding. Normally it's the other way around, and when looking at the code it should be obvious why. I'd like to see why this is, and rerun the benchmarks, but I can't seem to find your benchmarking code. You're right about the situation with Neon64. I noticed that myself when I added cross-platform CI builds a few days ago. (I don't own a Neon64-capable device that I can test on.) I'd like to know what hardware Travis CI uses for their Aarch64 machines. It's possible that it's not even real ARM hardware, but emulation, in which case the results will naturally be poor. Note that the Neon32 benchmarks are for sure run through Qemu (by our CI script), so they are not indicative of actual performance. If true, the Neon64 numbers are worrying, because those codecs apparently have very good performance on the iPhone. Maybe the performance hit is due to poor code generation by the compiler, or something else like register pressure. I'm considering parts of the codec in inline assembler to see if it makes a difference. I'm also considering buying a cheap Aarch64 board so that I have something to test on. Until that time, your library looks to be much faster on that platform. |
lxd-arm64 → the build ran within an LXD container on Arm64-based infrastructure (currently delivered by Packet) acc. to docs |
@aklomp, the poor performances in NEON are probably due to the differences in architectures between an iPhone (very good at lookups and vldX/vstX in NEON) and the one used by travis-ci (I don't know if they use ThunderX or Ampere chips, but I know that ThunderX is bad for those operations). libjpeg-turbo uses different implementations depending on the actual architecture. |
Hi @aklomp klomp, thank you. I'm only interpreting the benchmark numbers and effectively other SSE base64 It's not unusual to have scalar beating (unoptimized) sse and even avx2 As stated in the readme, the benchmarks are done with turbobench. travis is using (Ampere eMAG) packet ARM servers |
Hi, |
Now doubt there is a difference in the measurement method. We already found there are bounds depending on buffer size, threads used, CPU type, memory speed. I imagine that with your benchmark the effect of the cache is eliminated (large buffer, random data). This might be comparable to Edison, which has no cache at all. And indeed with Edison SSSE3 encoding is faster then decoding. See (my slightly outdated) benchmarks. If so, turning on multi-threading will speed up plain, but not change SSSE3 and AVX2. |
Sure, base64 benchmark uses input size for both function (decoded/raw byte count for encode, encoded byte count for decode) while the turbobase64 bench seems to always use the decoded/raw byte count. |
@powturbo Thanks for doing the benchmarks. The difference between the encoder and decoder is more in line now with what I would expect. If you ran the tests on a fresh checkout, it may be because of some codec improvements. I'm still impressed by the speed of your plain codec. I've been playing with my own 12-bit lookup table code over the past few days and found it impossible to beat yours, even when I pull out all the tricks like loop unrolling. I'll push that code soon because halving the number of table lookups makes a massive performance difference. The benchmarking in this library was always a bit of an afterthought. The benchmarking program is quickly thrown together without much regard to eliminating forms of bias, or much thought about what to measure to get meaningful and repeatable results. It should probably be on the longlist of things that need work. The variance in the results can be frustrating. Sometimes it's hard to tell whether a change is really an improvement or a regression. Codec performance varies enormously across processors, that's for sure. Code that is fast on one architecture is slow on another, as demonstrated most clearly by the NEON64 code. I have plans to add a new optional (In other news, I just released v0.4.0. It's a release of the status quo, so that existing users can anchor to this version while we have the freedom to make potentially breaking changes.) |
Use saturation for range check.
This commit is wholly dead code. In the future, it would be useful anywhere the PSHUFB instruction is slow (#44) or not supported.