Closed

Description
I noticed this is_ascii
check:
rust/src/libcore/char/methods.rs
Line 1078 in 6187684
Then someone told me it's there because of the
as u8
. If you pass a non-ASCII char, the as u8
would corrupt that char. That's why the is_ascii
check is there before it. But then I wondered: why isn't this just cast to a bigger data type that includes all Unicode code points? This method (and others) are slower than they have to be.Also see this discussion in Discord with more valuable information: https://discordapp.com/channels/442252698964721669/443150878111694848/629982972052897792.
Activity
SimonSapin commentedon Oct 5, 2019
(I can’t seem to find the relevant part in the middle of this discord stream…)
I’m not sure what you mean by “corrupt”.
some_char as u8
is well defined: it takes the numerical value of the code point (same assome_char as u32
), then truncates to keep the lower 8 bits.'A' as u8
equals 0x41, but so does'🁁' as u8
. We want to reject the latter since U+1F041 🁁 is not an ASCII upper-case letter. Ifchar::is_ascii_uppercase
is implemented in terms ofu8::is_ascii_uppercase
, the ASCII range check is necessary. (It could be au8
range check to, for example withu8::try_from(u32)
.)But maybe
char::is_ascii_uppercase
doesn’t need to be implemented in terms ofu8::is_ascii_uppercase
. Maybe it could duplicate the logic instead.Once upon a time
u8::is_ascii_uppercase
was implemented with a lookup table of 256 entries. A flat lookup table for all of Unicode would be way too big. But these days it’s based on matching au8
range pattern. Doing the same with achar
range pattern could also work.That said, would it actually be faster? Maybe the optimizer already does its job well. If you want to work on this, consider adding some benchmark to show that code duplication is worth it.
ghost commentedon Oct 5, 2019
Oh it seems the Discord link is a bit buggy. Try searching for
Why is there this self.is_ascii() check?
in the search bar on the Discord server. Then you should get to my message with the discussion below it. One guy also posted this comparison of the instructions generated: https://godbolt.org/z/RBADFT.BurntSushi commentedon Oct 5, 2019
I think the minimal standard here should be a benchmark before we change the implementation. Can you provide that?
ghost commentedon Oct 5, 2019
I'm not that experienced with Rust yet, sorry. I just wanted to point out this possible performance regression.
Shouldn't the instruction amount difference as shown on Godbolt suffice to know that it's slower?
vs.
anirudhb commentedon Oct 6, 2019
Looking at the assembly, it seems that the problem is the
is_ascii
check.These three functions produce identical assembly:
So, it seems that the
is_ascii
check is unnecessary.I could make a PR for this.
anirudhb commentedon Oct 6, 2019
Here are some benchmarks:
So it seems that removing the
is_ascii
check has a measurable performance difference.SimonSapin commentedon Oct 7, 2019
custom_ascii_uppercase_with_ascii_check_custom_is_ascii_convert_to_u8
is incorrect. The conversion should happen not for the ASCII check but for thematch
, and the pattern use byte literals likeb'A'
. (Though I expect this won’t affect results much.)char::is_ascii_*
codegen #67585Rollup merge of rust-lang#67585 - ranma42:fix/char-is-ascii-codegen, …
char::is_ascii_digit()
is slower thanchar::is_digit(10)
#68453