Add SSE2 dec_loop.c #46

aqrit · 2017-11-25T18:48:26Z

Use saturation for range check.

This commit is wholly dead code. In the future, it would be useful anywhere the PSHUFB instruction is slow (#44) or not supported.

Use saturation for range check. This commit is wholly dead code. But it would be useful, in the future, anywhere the PSHUFB instruction is slow or not supported.

aqrit · 2017-11-25T19:21:20Z

I'm unsure about the best way to hook it up.

I would advocate for a minimalist cpu detection library (#13)
maybe the one from the x265 project?

aklomp · 2019-11-12T10:59:37Z

What's interesting is that SSE2 is a mandatory part of the AMD64 architecture, so we could use this codec as a fallback instead of the plain decoder when we detect at compile time that we're compiling for AMD64.

As for CPU detection, I want to tackle this in the future, but was thinking about using GCC's builtin-cpu family of functions instead of maintaining something myself.

aklomp · 2019-11-12T11:07:18Z

lib/arch/sse2/dec_loop.c

+	}
+
+	// Pack 16 bytes into the 12-byte output format and store back:
+	_mm_storeu_si128((__m128i *)o, dec_reshuffle(out));


The dec_reshuffle function appears to be undefined.

This was meant for atom CPUs
so the ssse3 dec_reshuffle() would have been used.

I tried to do a sse2 packer here
but it is not too great.

I tried to get the code working with the SSSE3 decoder to get an idea of its performance, but hit a small bug: the function call should be dec_reshuffle(m0).

If the assumption is that this code would be used in conjunction with the SSSE3 reshuffler, then why not use the full SSSE3 codec?

One possible answer comes when testing this on a 32-bit Atom N270. There is a large improvement over the stock SSSE3 decoder: 591 MB/s for the SSE2 code against 427 MB/s for the SSSE3 code.

It's significantly slower on my desktop i5-4590S though: 6150 MB/s against 6800 MB/s for the stock SSSE3 codec. This makes it hard to justify merging this code for common use.

Maybe GCC's __builtin_cpu_is("atom") builtin would be useful here...

htot · 2019-11-15T21:45:30Z

As for CPU detection, I want to tackle this in the future, but was thinking about using GCC's builtin-cpu family of functions instead of maintaining something myself.

We might use CPU detection to tweak the algorithm for performance as well, see benchmarks.

aqrit · 2019-11-15T22:43:44Z

I wonder if the new packing code is slower on old (re: atom) hardware as well.
It trades out logic/shift for mul.

Does someone want to try it with the old dec_reshuffle()

Also Atom should probably be using aligned loads.

edit: this is being built 64-bit right?

I assume ppl have seen the new competition?
https://github.com/powturbo/TurboBase64/blob/master/index.md
https://github.com/WojciechMula/base64-avx512

aklomp · 2019-11-15T22:56:40Z

I did see the new breakthroughs by @WojciechMula and @lemire, it's what piqued my interest in working on this library again. Their AVX512 codec is impressive work. I'd like to add it to this library some day, but the problem is that it currently requires exotic hardware to test and verify the code. So it's something for the longer term.

I did see @powturbo's code. While their plain codec is indeed faster than our plain codec, their other claims don't really hold water. I did the benchmarks myself and SIMD still wins in the end. Notice that our NEON codecs are mysteriously missing from their benchmarks, and the AVX2 codecs only get a mention but no numbers.

aqrit · 2019-11-15T23:09:00Z

this patch needs tweaking for efficient operation with only 8 xmm registers

htot · 2019-11-16T19:51:29Z

@powturbo benchmarks say our decoding is slower than our coding, the opposite to our own benchmarks. Somewhere something must be wrong.

aklomp · 2019-11-16T23:19:32Z

I pushed a quick-and-dirty branch called turbobase64 which integrates the TurboBase64 codec into our benchmarks. You can build it yourself or see some results in the CI here. I didn't hook their codec up to our test suite yet.

The result: their plain codec is faster than ours (if it also does all necessary validity checks), but is no match for the SIMD codecs by a long shot. Don't believe everything you read on the Internet.

Agreed about the decoding anomaly, our decoder should be faster than the encoder. I'd like to see a reproducible process to generate those benchmarks.

powturbo · 2019-11-17T04:45:49Z

Well, TurboBase64 is a small experiment existing since 2016.
It's strange that you've not done any attention since there.
The fastbase people have also tested my library and not complained about
the results.
Nobody has included turbobase64 in benchmarks since there.

I've done primarly the benchmarks against fastbase64 that included
other base64 functions form chromium, apple quicktime, linux,...
At the end, I've included the SSE benchmark from your library.
Recently, I've adapted turbobase64 to ARM and done some benchmarks
against the Neon implementation in fastbase64.
I've not redone all other benchmarks (same cpu). As your library seems
not maitained since a long time, I'm not aware of any improvement done.
At the time of benchmarking the results where 100% correct and effectively scalar was faster
than other scalar libraries and decoding even faster that SSE with the hardware and compilers at this time.
You are invited to verify the benchmark with the same conditions.
The goal is not to propose a library faster than everything,
but to show that scalar can also be improved to be possibly competitive with or faster
than other SIMD implementations.
I can simply include the famous SIMD work of Wojciech Muła, if I want to
add other SSE, AVX2 functions to turbobase64.
Detecting cpu used and switching between the functions is a trivial work.

Edit: Your CI results are showing, turbobase64 is decoding 2x faster than Base64 SIMD Neon.

aklomp · 2019-11-17T18:09:26Z

Hi @powturbo, nice to see you've joined the discussion! You've done some impressive work on your library. Your plain codec is a speed monster, and I'm a bit envious of it actually. I just think you're overselling its speed a bit when compared to codecs that use a SIMD approach. It's like saying you have the fastest bicycle in a world where motorcycles exist. It's still a great achievement, and it has the potential to become a very portable, "fast everywhere" library.

I was unaware of your codec until your comments on HN some weeks ago. Then again, as you point out, I haven't been very active in the Base64 field since 2016. Many other hobbies to explore and lives to lead, etc. I've decided to start working on this library again because I left a bunch of loose ends a few years ago and I want to clean them up as best I can, so that I can hopefully "complete" this library and move on in good conscience.

I did notice that your decoder does not check the validity of the input. Not every string is a valid Base64 string. That's perhaps fine for the library's intended use cases, but it's also a shortcut that makes it easier to create a fast table-based algorithm. I was unable to hook up your decoder to our test suite because it does not output such an "input is bad" flag.

As for the issue of benchmarking, it's interesting that your benchmark concludes that our decoding is slower than our encoding. Normally it's the other way around, and when looking at the code it should be obvious why. I'd like to see why this is, and rerun the benchmarks, but I can't seem to find your benchmarking code.

You're right about the situation with Neon64. I noticed that myself when I added cross-platform CI builds a few days ago. (I don't own a Neon64-capable device that I can test on.) I'd like to know what hardware Travis CI uses for their Aarch64 machines. It's possible that it's not even real ARM hardware, but emulation, in which case the results will naturally be poor. Note that the Neon32 benchmarks are for sure run through Qemu (by our CI script), so they are not indicative of actual performance.

If true, the Neon64 numbers are worrying, because those codecs apparently have very good performance on the iPhone. Maybe the performance hit is due to poor code generation by the compiler, or something else like register pressure. I'm considering parts of the codec in inline assembler to see if it makes a difference. I'm also considering buying a cheap Aarch64 board so that I have something to test on. Until that time, your library looks to be much faster on that platform.

htot · 2019-11-17T19:53:12Z

I'd like to know what hardware Travis CI uses for their Aarch64 machines.

lxd-arm64 → the build ran within an LXD container on Arm64-based infrastructure (currently delivered by Packet) acc. to docs

mayeut · 2019-11-17T20:02:13Z

@aklomp, the poor performances in NEON are probably due to the differences in architectures between an iPhone (very good at lookups and vldX/vstX in NEON) and the one used by travis-ci (I don't know if they use ThunderX or Ampere chips, but I know that ThunderX is bad for those operations). libjpeg-turbo uses different implementations depending on the actual architecture.

powturbo · 2019-11-19T18:27:00Z

Hi @aklomp klomp,

thank you.

I'm only interpreting the benchmark numbers and effectively other SSE base64
at the benchmark time wasn't so fast.
Scalar was faster, but you may get other results depending on the hardware used,
because the difference was not so big (see benchmarks).
I've also benchmarked base64simd and the results were similar.

It's not unusual to have scalar beating (unoptimized) sse and even avx2
(see ex. https://github.com/powturbo/TurboHist ).

As stated in the readme, the benchmarks are done with turbobench.
1 - clone turbobench
2 - Download base64, turbobase64 and fastbase64 into the turbobench directory.
3 - Compile base64 library with the provided base64 makefile.
(output libbase64.o)
4 - Type "make BASE64=1" under the turbobench directory
5 - Type "./turbobench -eturbob64/b64_ssse3/b64_sse41 file -I15 -J15"
When, you use the base64 benchmarked version (I think beginn 2017), you must see that
decoding is a lot slower than encoding for base64 sse.
You can verify this also with your own benchmark tools.

travis is using (Ampere eMAG) packet ARM servers
See Review

powturbo · 2019-11-21T19:29:14Z

Hi,
I've benchmarked now base64 using the latest version including avx2.
In my benchmarks with the cpu used, sse encoding is still faster than decoding.

htot · 2019-11-21T22:47:05Z

Now doubt there is a difference in the measurement method. We already found there are bounds depending on buffer size, threads used, CPU type, memory speed. I imagine that with your benchmark the effect of the cache is eliminated (large buffer, random data). This might be comparable to Edison, which has no cache at all. And indeed with Edison SSSE3 encoding is faster then decoding. See (my slightly outdated) benchmarks. If so, turning on multi-threading will speed up plain, but not change SSSE3 and AVX2.

mayeut · 2019-11-21T23:01:08Z

Now doubt there is a difference in the measurement method.

Sure, base64 benchmark uses input size for both function (decoded/raw byte count for encode, encoded byte count for decode) while the turbobase64 bench seems to always use the decoded/raw byte count.

aklomp · 2019-11-21T23:43:32Z

@powturbo Thanks for doing the benchmarks. The difference between the encoder and decoder is more in line now with what I would expect. If you ran the tests on a fresh checkout, it may be because of some codec improvements.

I'm still impressed by the speed of your plain codec. I've been playing with my own 12-bit lookup table code over the past few days and found it impossible to beat yours, even when I pull out all the tricks like loop unrolling. I'll push that code soon because halving the number of table lookups makes a massive performance difference.

The benchmarking in this library was always a bit of an afterthought. The benchmarking program is quickly thrown together without much regard to eliminating forms of bias, or much thought about what to measure to get meaningful and repeatable results. It should probably be on the longlist of things that need work. The variance in the results can be frustrating. Sometimes it's hard to tell whether a change is really an improvement or a regression.

Codec performance varies enormously across processors, that's for sure. Code that is fast on one architecture is slow on another, as demonstrated most clearly by the NEON64 code. I have plans to add a new optional codec_choose backend which uses GCC's __builtin_cpu runtime detection to determine x86 architecture at runtime. That should make it possible to have fine-grained control over which codec runs where. (And this closes the circle with the topic of this PR :)

(In other news, I just released v0.4.0. It's a release of the status quo, so that existing users can anchor to this version while we have the freedom to make potentially breaking changes.)

add SSE2 dec_loop.c

15e59e6

Use saturation for range check. This commit is wholly dead code. But it would be useful, in the future, anywhere the PSHUFB instruction is slow or not supported.

aklomp reviewed Nov 12, 2019

View reviewed changes

aklomp added the performance label Jun 8, 2022

aqrit mentioned this pull request Jun 23, 2022

Benchmarks #95

Draft

aqrit closed this Jun 27, 2022

Add SSE2 dec_loop.c #46

Add SSE2 dec_loop.c #46

Uh oh!

Conversation

aqrit commented Nov 25, 2017

Uh oh!

aqrit commented Nov 25, 2017

Uh oh!

aklomp commented Nov 12, 2019

Uh oh!

aklomp Nov 12, 2019

Choose a reason for hiding this comment

Uh oh!

aqrit Nov 12, 2019

Choose a reason for hiding this comment

Uh oh!

aklomp Nov 12, 2019

Choose a reason for hiding this comment

Uh oh!

htot commented Nov 15, 2019

Uh oh!

aqrit commented Nov 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aklomp commented Nov 15, 2019

Uh oh!

aqrit commented Nov 15, 2019

Uh oh!

htot commented Nov 16, 2019

Uh oh!

aklomp commented Nov 16, 2019

Uh oh!

powturbo commented Nov 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aklomp commented Nov 17, 2019

Uh oh!

htot commented Nov 17, 2019

Uh oh!

mayeut commented Nov 17, 2019

Uh oh!

powturbo commented Nov 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

powturbo commented Nov 21, 2019

Uh oh!

htot commented Nov 21, 2019

Uh oh!

mayeut commented Nov 21, 2019

Uh oh!

aklomp commented Nov 21, 2019

Uh oh!

Uh oh!

aqrit commented Nov 15, 2019 •

edited

Loading

powturbo commented Nov 17, 2019 •

edited

Loading

powturbo commented Nov 19, 2019 •

edited

Loading