Description
Proposal
Problem statement
For individual bytes, the backend can plausibly be expected to optimize things when it knows that a char
came from an ASCII-ranged value. However, for compound values, there's no realistic way that codegen backends can track enough information to optimize out UTF-8 validity checks. That leads to lots of "look, it's ASCII" comments on unsafe
blocks because the safe equivalents have material performance impact.
We should offer a nice way that people can write such "the problem fundamentally only produces ASCII String
s" code without needing unsafe
and without needing spurious UTF-8 validation checks.
After all, to quote std::ascii
,
However, at times it makes more sense to only consider the ASCII character set for a specific operation.
Motivation, use-cases
I was reminded about this by this comment in rust-lang/rust#105076:
pub fn as_str(&self) -> &str {
// SAFETY: self.data[self.alive] is all ASCII characters.
unsafe { crate::str::from_utf8_unchecked(self.as_bytes()) }
}
But I've been thinking about this since this Reddit thread: https://www.reddit.com/r/rust/comments/yaft60/zerocost_iterator_abstractionsnot_so_zerocost/. "base85" encoding is an examplar of problems where problem is fundamentally only producing ASCII. But the code is doing a String::from_utf8(outdata).unwrap()
at the end because other options aren't great.
One might say "oh, just build a String
as you go" instead, but that doesn't work as well as you'd hope. Pushing a byte onto a Vec<u8>
generates substantially less complicated code than pushing one to a String
(https://rust.godbolt.org/z/xMYxj5WYr) since a 0..=255
USV might still take 2 bytes in UTF-8. That problem could be worked around with a BString
instead, going outside the standard library, but that's not a fix for the whole thing because then there's still a check needed later to get back to a &str
or String
.
There should be a core
type for an individual ASCII character so that having proven to LLVM at the individual character level that things are in-range (which it can optimize well, and does in other similar existing cases today), the library can offer safe O(1) conversions taking advantage of that type-level information.
[Edit 2023-02-16] This conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/.E2.9C.94.20Iterate.20ASCII-only.20.26str/near/328343781 made me think about this too -- having a type gives clear ways to get "just the ascii characters from a string" using something like .filter_map(AsciiChar::new)
.
[Edit 2023-04-27] Another conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/core.3A.3Astr.3A.3Afrom_ascii/near/353452589 about how on embedded the "is ascii" check is much simpler than the "is UTF-8" check, and being able to use that where appropriate can save a bunch of binary size on embedded. cc @kupiakos
Solution sketches
In core::ascii
,
/// One of the 128 Unicode characters from U+0000 through U+007F, often known as
/// the [ASCII](https://www.unicode.org/glossary/index.html#ASCII) subset.
///
/// AKA the characters codes from ANSI X3.4-1977, ISO 646-1973,
/// or [NIST FIPS 1-2](https://nvlpubs.nist.gov/nistpubs/Legacy/FIPS/fipspub1-2-1977.pdf).
///
/// # Layout
///
/// This type is guaranteed to have a size and alignment of 1 byte.
#[derive(Copy, Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
#[repr(transparent)]
struct Char(u8 is 0..=127);
impl Debug for Char { … }
impl Display for Char { … }
impl Char {
const fn new(c: char) -> Option<Self> { … }
const fn from_u8(x: u8) -> Option<Self> { … }
const unsafe fn from_u8_unchecked(x: u8) -> Self { … }
}
impl From<Char> for char { … }
impl From<&[Char]> for &str { … }
In alloc::string
:
impl From<Vec<ascii::Char>> for String { … }
^ this From
being the main idea of the whole thing
Safe code can Char::new(…).unwrap()
since LLVM easily optimizes that for known values (https://rust.godbolt.org/z/haabhb6aq) or they can do it in const
s, then use the non-reallocating infallible From
s later if they need String
s or &str
s.
Other possibilities
I wouldn't put any of these in an initial PR, but as related things
- This could be a 128-variant enum with
repr(u8)
. That would allowas
casting it, for better or worse. - There could be associated constants (or variants) named
ACK
andDEL
and such. - Lots of
AsRef<str>
s are possible, like onascii::Char
itself or arrays/vectors thereof- And potentially
AsRef<[u8]>
s too
- And potentially
- Additional methods like
String::push_ascii
- More implementations like
String: Extend<ascii::Char>
orString: FromIterator<ascii::Char>
- Checked conversions (using the well-known ASCII fast paths) from
&str
(or&[u8]
) back to&[ascii::Char]
- The base85 example would really like a
[u8; N] -> [ascii::Char; N]
operation it can use in aconst
so it can have something likeconst BYTE_TO_CHAR85: [ascii::Char; 85] = something(b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#$%&()*+-;<=>?@^_`{|}~").unwrap();
. Long-term that's calledarray::map
and unwrapping each value, but without const closures that doesn't work yet -- for now it could open-code it, though.
And of course there's the endless bikeshed on what to call the type in the first place. Would it be worth making it something like ascii::AsciiChar
, despite the stuttering name, to avoid Char
vs char
confusion?
What happens now?
This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.
Activity
pitaj commentedon Feb 12, 2023
I like the idea. Could it be added as a private API and used as an implementation detail for now? That would be a perfect demonstration of the usefulness of this type.
One consideration: if we add an "ascii character" type, why not also an "ascii string" type? Or would an "ascii string" just be
[ascii::Char]
/Vec<ascii:Char>
scottmcm commentedon Feb 12, 2023
It doesn't fit great as a
pub(crate)
internal thing. To use it in bothcore
andalloc
, it needs to be pub and unstable+hidden, so I'd rather just make it an unstable thing than try to hide it. Andcore
has lots ofunsafe
already, so removing a few makes much less of a different compared to in external crates where it might be the onlyunsafe
. I'm not sure I'd want to, say, churn all the stringizing code to use this if libs-api isn't a fan of the general idea.There'd be no extra invariants for an
AsciiStr
orAsciiString
, since there's nothing like the "but it's UTF-8" thatstr
andString
have over[u8]
andVec<u8>
. Whether they should exist anyway to have a separate documentation page, so thatVec<T>
isn't cluttered with a bunch ofVec<ascii::Char>
methods, I'm poorly equipped to judge. (I didn't want to get into discussions like whether there should bemake_ascii_lowercase
-style things that work on sequences of ascii chars, which is part of why I focused on using it to buildString
s, rather than on operating on them.)safinaskar commentedon Feb 20, 2023
It seems you build on pattern types ( rust-lang/rust#107606 )?
scottmcm commentedon Feb 20, 2023
@safinaskar I wrote it that way as the simplest way to phrase it in the summary. The field is private, so there are many possible implementation -- including the basic works-on-stable approach of just making it (be or wrap) a big enum -- and this could thus move forward immediately, without needing to be blocked on speculative features like pattern types.
programmerjake commentedon Apr 27, 2023
one other thing to add:
From<AsciiChar> for u8
safinaskar commentedon Apr 28, 2023
You can wrap 128-variant enum in
#[repr(transparent)]
struct :)kupiakos commentedon Apr 28, 2023
Worth mentioning this is exactly what the
ascii
crate doesBurntSushi commentedon Apr 28, 2023
I'll second this.
What I like about this is that it feels like a pretty small addition (a single type), but it provides a lot of additional expressive and safe power. I've run into these kinds of situations in various places, and it is in part (small part) one of the nice things about
bstr
. That is, the reason why anascii::Char
is useful is, IMO, mostly because of the UTF-8 requirement ofString
/&str
. Without the UTF-8 requirement, a lot of the pain that motivates anascii::Char
goes away.Also, while not part of the proposal, I saw a mention of separate
AsciiString
/AsciiStr
types in the discussion above. I think these would probably be a bad idea. In particular,bstr 0.1
started with distinctBString
/BStr
types, and it turned out to be a major pain in practice. It's because you were constantly needing to convert to and fromVec<u8>
/&[u8]
types. It seems like a small thing, but it's a lot of friction.One other thing that I think is worth bringing up is the
Debug
impl. Having a distinctascii::Char
does let us make theDebug
impl for&[ascii::Char]
nicer than what you'd get with&[u8]
. But I don't think it can be as nice as a dedicated impl for&AsciiStr
. (Although perhapsstd
can use specialization to fix that. Idk.)kupiakos commentedon Apr 28, 2023
Other reproductions of an ASCII enum: icu4x, the Unicode library, has a
tinystr::AsciiByte
enum. There's even meme that's the top Google result forrustc_layout_scalar_valid_range_end
.Personally, I lean towards using an
enum
like that over this design:This is a less flexible design. The names or properties of ASCII characters will never change and there aren't many of them, and so we might as well expose them as variants that can participate in pattern matching.
It looks like that's stillEDIT: it still does provide this model, but it suggests you use extension traits withbstr
's model and hasn't changed. ThoughBStr: Deref<Target=[u8]>
andBString: Deref<Target=Vec<u8>>
do mitigate issues.[u8]
andVec<u8>
instead ofBStr
/BString
in APIs as the conversions are free. Worth noting that @BurntSushi is the author ofbstr
.BurntSushi commentedon Apr 28, 2023
It's not.
bstr
works by defining extension traits,ByteSliceExt
andByteVecExt
, which provide additional methods on[u8]
andVec<u8>
, respectively. It still definesBStr
andBString
types for various reasons that generally revolve around "it's convenient to have a dedicated string type." For example, forDebug
and other trait impls (such as Serde).bstr 0.1
didn't have extension traits at all. It just hadBStr
andBString
and a bunch of inherent methods.ascii::Char
(ACP 179) rust-lang/rust#110998ascii::Char
(ACP#179) rust-lang/rust#111009scottmcm commentedon May 1, 2023
PR open: rust-lang/rust#111009
Tracking issue: rust-lang/rust#110998
11 remaining items