-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Optimize pack() #18524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Optimize pack() #18524
Conversation
maybe a interessting benchmark/test-case for
vs.
using the |
Unpack will always be slower than just strlen. However, your code revealed that repetitions were handled in a slow way where lots of temporary strings were created and then parsed. I opened a PR to fix that particular issue: #18803 |
could benchmark static inline void php_pack(const zval *val, size_t size,
php_pack_endianness enc, char *out)
{
zend_long z = zval_get_long(val);
if ((enc == PHP_LITTLE_ENDIAN) != MACHINE_LITTLE_ENDIAN) {
z = PHP_LONG_BSWAP(z);
}
memcpy(out, (char*)&z + sizeof(z) - size, size);
} might be faster |
Very strangely, my original code with zend_never_inline is slightly faster than master, but your code without zend_never_inline seems to beat that in my test with the 'J' specifier. Testing some more stuff... |
if the performance difference insignificant/marginal, as in hardly even benchmark-able, i would recommend just ignoring it. I like how this makes pack the code much simpler (assuming it actually works on BE) |
I think I managed to make the compiler happy and let it make good inlining decisions while keeping the code simple. For example for this: for ($i = 0; $i < 10_000_000; ++$i) {
pack("J", 0x7FFFFFFFFFFFFFFF);
} On an i7-4790:
And for this: for ($i=0;$i<4000000;$i++)
pack("nvc*", 0x1234, 0x5678, 65, 66); On the same machine:
Let's hope it's reproducible |
47a5320
to
3b1918b
Compare
It's consistent on my laptop. Squashed the commits and updated the descriptions. |
Instead of using lookup tables, we can use a combination of shifts and byte swapping to achieve the same thing in less cycles and with less code. Benchmark files --------------- pack1.php: ```php for ($i = 0; $i < 10_000_000; ++$i) { pack("J", 0x7FFFFFFFFFFFFFFF); } ``` pack2.php: ```php for ($i = 0; $i < 4000000; ++$i) { pack("nvc*", 0x1234, 0x5678, 65, 66); } ``` On an i7-4790: ``` Benchmark 1: ./sapi/cli/php pack1.php Time (mean ± σ): 408.8 ms ± 3.4 ms [User: 406.1 ms, System: 1.6 ms] Range (min … max): 403.6 ms … 413.6 ms 10 runs Benchmark 2: ./sapi/cli/php_old pack1.php Time (mean ± σ): 451.7 ms ± 7.7 ms [User: 448.5 ms, System: 2.0 ms] Range (min … max): 442.8 ms … 461.2 ms 10 runs Summary ./sapi/cli/php pack1.php ran 1.11 ± 0.02 times faster than ./sapi/cli/php_old pack1.php Benchmark 1: ./sapi/cli/php pack2.php Time (mean ± σ): 239.3 ms ± 6.0 ms [User: 236.2 ms, System: 2.3 ms] Range (min … max): 233.2 ms … 256.8 ms 12 runs Benchmark 2: ./sapi/cli/php_old pack2.php Time (mean ± σ): 271.9 ms ± 3.3 ms [User: 269.7 ms, System: 1.3 ms] Range (min … max): 267.4 ms … 279.0 ms 11 runs Summary ./sapi/cli/php pack2.php ran 1.14 ± 0.03 times faster than ./sapi/cli/php_old pack2.php ``` On an i7-1185G7: ``` Benchmark 1: ./sapi/cli/php pack1.php Time (mean ± σ): 263.7 ms ± 1.8 ms [User: 262.6 ms, System: 0.9 ms] Range (min … max): 261.5 ms … 268.2 ms 11 runs Benchmark 2: ./sapi/cli/php_old pack1.php Time (mean ± σ): 303.3 ms ± 6.5 ms [User: 300.7 ms, System: 2.3 ms] Range (min … max): 297.4 ms … 318.1 ms 10 runs Summary ./sapi/cli/php pack1.php ran 1.15 ± 0.03 times faster than ./sapi/cli/php_old pack1.php Benchmark 1: ./sapi/cli/php pack2.php Time (mean ± σ): 156.7 ms ± 2.9 ms [User: 154.7 ms, System: 1.7 ms] Range (min … max): 151.6 ms … 164.7 ms 19 runs Benchmark 2: ./sapi/cli/php_old pack2.php Time (mean ± σ): 174.6 ms ± 3.3 ms [User: 171.9 ms, System: 2.3 ms] Range (min … max): 170.7 ms … 180.4 ms 17 runs Summary ./sapi/cli/php pack2.php ran 1.11 ± 0.03 times faster than ./sapi/cli/php_old pack2.php ``` Co-authored-by: [email protected]
Rebased to solve conflict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified the logic and it looks good. Nice work with reducing all that stuff. Approving with an assumption that you managed to somehow test it on BE? If not, maybe @NattyNarwhal might be able to test it.
I quickly tested on the CI box, lot of pack stuff broken:
(ignoring failed tests that are already failing... I'll have to look at those monday) |
Probably something subtle with shifting around, I can take a look thursday |
If someone has a big endian server running ssh they'd be willing to share, that would be helpful. Likewise my keys at https://github.com/divinity76.keys |
You should be able to emulate it in qemu but haven't tried it myself so not sure if it's any good. |
I think Niels should already have access to that BE CI system, but if anyone needs access to try things, let me know. |
@NattyNarwhal can I get access? |
@divinity76 I’ve emailed you credentials. |
@nielsdos try diff --git a/ext/standard/pack.c b/ext/standard/pack.c
index 0bdc9fc7..3b840e4d 100644
--- a/ext/standard/pack.c
+++ b/ext/standard/pack.c
@@ -57,14 +57,26 @@ typedef ZEND_SET_ALIGNED(1, int unaligned_int);
/* {{{ php_pack */
static void php_pack(const zval *val, size_t size, php_pack_endianness endianness, char *output)
{
- zend_ulong zl = zval_get_long(val);
-
- if ((endianness == PHP_LITTLE_ENDIAN) != MACHINE_LITTLE_ENDIAN) {
- zl = PHP_LONG_BSWAP(zl);
- zl >>= (sizeof(zl) - size) * 8;
- }
-
- memcpy(output, (const char *) &zl, size);
+ zend_ulong zl = zval_get_long(val);
+#if MACHINE_LITTLE_ENDIAN
+ if (endianness != PHP_LITTLE_ENDIAN) {
+ /* packing big-endian on LE: swap full word first */
+ zl = PHP_LONG_BSWAP(zl);
+ }
+ /* on LE, LSBs are at &zl[0], so just copy */
+ memcpy(output, (const char *)&zl, size);
+#else
+ // big endian
+ if (endianness == PHP_BIG_ENDIAN) {
+ /* packing big-endian on BE: no swap, high bytes live at offset */
+ const char *src = (const char *)&zl + (sizeof(zl) - size);
+ memcpy(output, src, size);
+ } else {
+ /* packing little-endian on BE: swap then low bytes at &zl[0] */
+ zl = PHP_LONG_BSWAP(zl);
+ memcpy(output, (const char *)&zl, size);
+ }
+#endif
}
/* }}} */
that got all tests passing on @NattyNarwhal 's BE CI system 👍
(that |
as for the question "is it (still) faster?", divinity@php-ci-ppc64be ~/php-src $ cat bench.php
<?php
declare(strict_types=1);
for ($i=0;$i<4000000;$i++)
pack("nvc*", 0x1234, 0x5678, 65, 66);
divinity@php-ci-ppc64be ~/php-src $ hyperfine --warmup 3 --runs 20 \
'./sapi/cli/php ./bench.php' \
'./sapi/cli/php_old ./bench.php'
Benchmark 1: ./sapi/cli/php ./bench.php
Time (mean ± σ): 1.059 s ± 0.038 s [User: 1.053 s, System: 0.005 s]
Range (min … max): 1.033 s … 1.170 s 20 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: ./sapi/cli/php_old ./bench.php
Time (mean ± σ): 1.119 s ± 0.006 s [User: 1.114 s, System: 0.005 s]
Range (min … max): 1.111 s … 1.131 s 20 runs
Summary
./sapi/cli/php ./bench.php ran
1.06 ± 0.04 times faster than ./sapi/cli/php_old ./bench.php
divinity@php-ci-ppc64be ~/php-src $ cat /proc/cpuinfo
processor : 0
cpu : POWER8 (architected), altivec supported
clock : 3525.000000MHz
revision : 2.1 (pvr 004b 0201)
(...) |
dang.. I broke little-endian. Here is a version that has been tested successfully on both LE and BE: static void php_pack(const zval *val, size_t size, php_pack_endianness endianness, char *output)
{
zend_ulong zl = zval_get_long(val);
#if MACHINE_LITTLE_ENDIAN
if (endianness == PHP_BIG_ENDIAN) {
/* packing big-endian on LE: swap full word then copy from the high end */
zl = PHP_LONG_BSWAP(zl);
memcpy(output, (const char *)&zl + (sizeof(zl) - size), size);
} else {
/* packing little-endian on LE: no swap, just copy the low bytes */
memcpy(output, (const char *)&zl, size);
}
#else
if (endianness == PHP_LITTLE_ENDIAN) {
/* packing little-endian on BE: swap full word then copy from the low end */
zl = PHP_LONG_BSWAP(zl);
memcpy(output, (const char *)&zl, size);
} else {
/* packing big-endian on BE: no swap, just copy the high bytes */
memcpy(output, (const char *)&zl + (sizeof(zl) - size), size);
}
#endif
} here is BE-benchmarks: divinity@php-ci-ppc64be ~/php-src $ cat bench.php
<?php
declare(strict_types=1);
for ($i=0;$i<4000000;$i++)
pack("nvc*", 0x1234, 0x5678, 65, 66);
divinity@php-ci-ppc64be ~/php-src $ hyperfine --warmup 3 --runs 20 './sapi/cli/php ./bench.php' './sapi/cli/php_old ./bench.php'
Benchmark 1: ./sapi/cli/php ./bench.php
Time (mean ± σ): 1.031 s ± 0.004 s [User: 1.027 s, System: 0.005 s]
Range (min … max): 1.028 s … 1.041 s 20 runs
Benchmark 2: ./sapi/cli/php_old ./bench.php
Time (mean ± σ): 1.112 s ± 0.005 s [User: 1.107 s, System: 0.005 s]
Range (min … max): 1.107 s … 1.121 s 20 runs
Summary
./sapi/cli/php ./bench.php ran
1.08 ± 0.01 times faster than ./sapi/cli/php_old ./bench.php
divinity@php-ci-ppc64be ~/php-src $ cat /proc/cpuinfo | head
processor : 0
cpu : POWER8 (architected), altivec supported
clock : 3525.000000MHz
revision : 2.1 (pvr 004b 0201)
processor : 1
cpu : POWER8 (architected), altivec supported
clock : 3525.000000MHz
revision : 2.1 (pvr 004b 0201) here is LE benchmarks: hans@DESKTOP-EE15SLU:~/projects/php-src$ cat bench.php
<?php
declare(strict_types=1);
for ($i=0;$i<4000000;$i++)
pack("nvc*", 0x1234, 0x5678, 65, 66);
hans@DESKTOP-EE15SLU:~/projects/php-src$ hyperfine --warmup 3 --runs 20 './sapi/cli/php ./bench.php' './sapi/cli/php_old ./bench.php'
Benchmark 1: ./sapi/cli/php ./bench.php
Time (mean ± σ): 141.1 ms ± 1.1 ms [User: 135.4 ms, System: 1.4 ms]
Range (min … max): 140.0 ms … 144.2 ms 20 runs
Benchmark 2: ./sapi/cli/php_old ./bench.php
Time (mean ± σ): 147.9 ms ± 1.1 ms [User: 142.9 ms, System: 0.5 ms]
Range (min … max): 146.5 ms … 150.4 ms 20 runs
Summary
./sapi/cli/php ./bench.php ran
1.05 ± 0.01 times faster than ./sapi/cli/php_old ./bench.php
hans@DESKTOP-EE15SLU:~/projects/php-src$ cat /proc/cpuinfo | head
processor : 0
vendor_id : AuthenticAMD
cpu family : 25
model : 97
model name : AMD Ryzen 9 7950X 16-Core Processor
stepping : 2
microcode : 0xffffffff
cpu MHz : 4491.454
cache size : 1024 KB
physical id : 0 |
@divinity76 Thanks, that allowed me to figure out what was wrong. Tested on both LE & BE and it works now. |
Instead of using lookup tables, we can use a combination of shifts and
byte swapping to achieve the same thing in less cycles and with less
code.
Benchmark files
pack1.php:
pack2.php:
Results
On an i7-4790:
On an i7-1185G7: