Add an "ascii character" type to reduce `unsafe` needs in conversions

# Proposal

## Problem statement

For individual bytes, the backend can plausibly be expected to optimize things when it knows that a `char` came from an ASCII-ranged value.  However, for *compound* values, there's no realistic way that codegen backends can track enough information to optimize out UTF-8 validity checks.  That leads to lots of "look, it's ASCII" comments on `unsafe` blocks  because the safe equivalents have material performance impact.

We should offer a nice way that people can write such "the problem fundamentally only produces ASCII `String`s" code without needing `unsafe` and without needing spurious UTF-8 validation checks.

After all, [to quote `std::ascii`](https://doc.rust-lang.org/std/ascii/index.html),

> However, at times it makes more sense to only consider the ASCII character set for a specific operation.

## Motivation, use-cases

I was reminded about this by [this comment](https://github.com/rust-lang/rust/pull/105076/files#diff-cd71be06a2c36bdc291eee7db28d1ae47442105c61adfca4b5802e86a4fa56afR73-R76) in https://github.com/rust-lang/rust/pull/105076:
```rust
    pub fn as_str(&self) -> &str {
        // SAFETY: self.data[self.alive] is all ASCII characters.
        unsafe { crate::str::from_utf8_unchecked(self.as_bytes()) }
    }
```

But I've been thinking about this since this Reddit thread: <https://www.reddit.com/r/rust/comments/yaft60/zerocost_iterator_abstractionsnot_so_zerocost/>.  "base85" encoding is an examplar of problems where problem is fundamentally only producing ASCII.  But the code is doing a `String::from_utf8(outdata).unwrap()` at the end because other options aren't great.

One might say "oh, just build a `String` as you go" instead, but that doesn't work as well as you'd hope.  Pushing a byte onto a `Vec<u8>` generates *substantially* less complicated code than pushing one to a `String` (<https://rust.godbolt.org/z/xMYxj5WYr>) since a `0..=255` USV might still take 2 bytes in UTF-8.  That problem could be worked around with a [`BString`](https://docs.rs/bstring/latest/bstring/bstring/struct.BString.html) instead, going outside the standard library, but that's not a fix for the whole thing because then there's still a check needed later to get back to a `&str` or `String`.

There should be a `core` type for an individual ASCII character so that having proven to LLVM at the individual *character* level that things are in-range (which it can optimize well, and does in other similar existing cases today), the library can offer safe O(1) conversions taking advantage of that type-level information.

[Edit 2023-02-16] This conversation on zulip <https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/.E2.9C.94.20Iterate.20ASCII-only.20.26str/near/328343781> made me think about this too -- having a type gives clear ways to get "just the ascii characters from a string" using something like `.filter_map(AsciiChar::new)`.

[Edit 2023-04-27] Another conversation on zulip <https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/core.3A.3Astr.3A.3Afrom_ascii/near/353452589> about how on embedded the "is ascii" check is much simpler than the "is UTF-8" check, and being able to use that where appropriate can save a bunch of binary size on embedded. cc @kupiakos

## Solution sketches

In `core::ascii`,
```rust
/// One of the 128 Unicode characters from U+0000 through U+007F, often known as
/// the [ASCII](https://www.unicode.org/glossary/index.html#ASCII) subset.
///
/// AKA the characters codes from ANSI X3.4-1977, ISO 646-1973,
/// or [NIST FIPS 1-2](https://nvlpubs.nist.gov/nistpubs/Legacy/FIPS/fipspub1-2-1977.pdf).
///
/// # Layout
///
/// This type is guaranteed to have a size and alignment of 1 byte.
#[derive(Copy, Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
#[repr(transparent)]
struct Char(u8 is 0..=127);

impl Debug for Char { … }
impl Display for Char { … }

impl Char {
    const fn new(c: char) -> Option<Self> { … }
    const fn from_u8(x: u8) -> Option<Self> { … }
    const unsafe fn from_u8_unchecked(x: u8) -> Self { … }
}

impl From<Char> for char { … }
impl From<&[Char]> for &str { … }
```

In `alloc::string`:
```rust
impl From<Vec<ascii::Char>> for String { … }
```
^ this `From` being the main idea of the whole thing

Safe code can `Char::new(…).unwrap()` since LLVM easily optimizes that for known values (<https://rust.godbolt.org/z/haabhb6aq>) or they can do it in `const`s, then use the non-reallocating infallible `From`s later if they need `String`s or `&str`s.

## Other possibilities

I wouldn't put any of these in an initial PR, but as related things

- This could be a 128-variant enum with `repr(u8)`.  That would allow `as` casting it, for better or worse.
- There could be associated constants (or variants) named `ACK` and `DEL` and such.
- Lots of `AsRef<str>`s are possible, like on `ascii::Char` itself or arrays/vectors thereof
  - And potentially `AsRef<[u8]>`s too
- Additional methods like `String::push_ascii`
- More implementations like `String: Extend<ascii::Char>` or `String: FromIterator<ascii::Char>`
- Checked conversions (using the well-known [ASCII fast paths](https://github.com/rust-lang/rust/blob/master/library/core/src/str/validations.rs#L210)) from `&str` (or `&[u8]`) back to `&[ascii::Char]`
- The base85 example would really like a `[u8; N] -> [ascii::Char; N]` operation it can use in a `const` so it can have something like ```const BYTE_TO_CHAR85: [ascii::Char; 85] = something(b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#$%&()*+-;<=>?@^_`{|}~").unwrap();```.  Long-term that's called `array::map` and unwrapping each value, but without const closures that doesn't work yet -- for now it could open-code it, though.

And of course there's the endless bikeshed on what to call the type in the first place.  Would it be worth making it something like `ascii::AsciiChar`, despite the stuttering name, to avoid `Char` vs `char` confusion?

## What happens now?

This issue is part of the libs-api team [API change proposal process]. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.

[API change proposal process]: https://std-dev-guide.rust-lang.org/feature-lifecycle/api-change-proposals.html


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add an "ascii character" type to reduce `unsafe` needs in conversions #179

Proposal

Problem statement

Motivation, use-cases

Solution sketches

Other possibilities

What happens now?

11 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Add an "ascii character" type to reduce unsafe needs in conversions #179

Description

Proposal

Problem statement

Motivation, use-cases

Solution sketches

Other possibilities

What happens now?

Activity

pitaj commented on Feb 12, 2023

scottmcm commented on Feb 12, 2023

safinaskar commented on Feb 20, 2023

scottmcm commented on Feb 20, 2023

programmerjake commented on Apr 27, 2023

safinaskar commented on Apr 28, 2023

kupiakos commented on Apr 28, 2023

BurntSushi commented on Apr 28, 2023

kupiakos commented on Apr 28, 2023

BurntSushi commented on Apr 28, 2023

scottmcm commented on May 1, 2023

11 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions

Add an "ascii character" type to reduce `unsafe` needs in conversions #179