UTF-8 parsing with state machine

Hi, I'd like to propose a cleaner approach (IMHO) to parse UTF-8 in `core::str:from_utf8` and friends. Not confident to optimize the ASCII fast forward in unsafe, but I think a [state machine would describe the problem better](https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=974471087ef52f72660ebb1097877810) and is more readable.

As a bonus, the state machine could be made available for other more complex parsers. My case is for ANSI escape sequences which may not be valid UTF-8 but are embedded in UTF-8 streams. The machine exposes the number of `incomplete()` and `needed()` bytes, it can count the valid chars `seen()`. As such it would be very useful for general byte stream processing. It works with `no_std`, depending only on `core::str`.

The essence is quite simple:
```

impl Utf8Machine {
    pub fn turn(&mut self, input: &u8) -> Utf8Act {
        /*
         * legal utf-8 byte sequence
         * http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 94
         *
         *  Code Points        1st       2s       3s       4s
         * U+0000..U+007F     00..7F
         * U+0080..U+07FF     C2..DF   80..BF
         * U+0800..U+0FFF     E0       A0..BF   80..BF
         * U+1000..U+CFFF     E1..EC   80..BF   80..BF
         * U+D000..U+D7FF     ED       80..9F   80..BF
         * U+E000..U+FFFF     EE..EF   80..BF   80..BF
         * U+10000..U+3FFFF   F0       90..BF   80..BF   80..BF
         * U+40000..U+FFFFF   F1..F3   80..BF   80..BF   80..BF
         * U+100000..U+10FFFF F4       80..8F   80..BF   80..BF
         */
        use Utf8Act::*;
        use Utf8State::*;
        let seen = self.seen;
        let done = (self.seen + 1, Normal, Done);
        let (seen, state, act) = match (&self.state, input) {
            (Normal, 0x00...0x7F) => done,
            (Normal, 0x80...0xC1) => (seen, Normal, Invalid(1)),
            (Normal, 0xC2...0xDF) => (seen, C___x_, NeedMore(1)),
            (Normal, 0xE0...0xE0) => (seen, C__0__, NeedMore(2)),
            (Normal, 0xE1...0xEC) => (seen, C__x__, NeedMore(2)),
            (Normal, 0xED...0xED) => (seen, C__D__, NeedMore(2)),
            (Normal, 0xEE...0xEF) => (seen, C__x__, NeedMore(2)),
            (Normal, 0xF0...0xF0) => (seen, C_0___, NeedMore(3)),
            (Normal, 0xF1...0xF3) => (seen, C_x___, NeedMore(3)),
            (Normal, 0xF4...0xF4) => (seen, C_4___, NeedMore(3)),
            (Normal, 0xF5...0xFF) => (seen, Normal, Invalid(1)),

            (C___x_, 0x80...0xBF) => done,
            (C__xx_, 0x80...0xBF) => done,
            (C_xxx_, 0x80...0xBF) => done,

            (C__0__, 0xA0...0xBF) => (seen, C__xx_, NeedMore(1)),
            (C__D__, 0x80...0x9F) => (seen, C__xx_, NeedMore(1)),
            (C__x__, 0x80...0xBF) => (seen, C__xx_, NeedMore(1)),
            (C_0___, 0x90...0xBF) => (seen, C_xx__, NeedMore(2)),
            (C_4___, 0x80...0x8F) => (seen, C_xx__, NeedMore(2)),
            (C_x___, 0x80...0xBF) => (seen, C_xx__, NeedMore(2)),
            (C_xx__, 0x80...0xBF) => (seen, C_xxx_, NeedMore(1)),

            (C___x_, _) => (seen, Normal, Invalid(2)),
            (C__0__, _) => (seen, Normal, Invalid(2)),
            (C__D__, _) => (seen, Normal, Invalid(2)),
            (C__x__, _) => (seen, Normal, Invalid(2)),
            (C__xx_, _) => (seen, Normal, Invalid(3)),
            (C_0___, _) => (seen, Normal, Invalid(2)),
            (C_4___, _) => (seen, Normal, Invalid(2)),
            (C_x___, _) => (seen, Normal, Invalid(2)),
            (C_xx__, _) => (seen, Normal, Invalid(3)),
            (C_xxx_, _) => (seen, Normal, Invalid(4)),
        };
        self.seen = seen;
        self.state = state;
        act
    }
}
```

The playground includes applications of the state machine where one could implement the fast forward ASCII optimization.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UTF-8 parsing with state machine #59399

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

UTF-8 parsing with state machine #59399

Description

Activity

nagisa commented on Mar 24, 2019

jocutajar commented on Mar 25, 2019

nagisa commented on Mar 26, 2019

jocutajar commented on Mar 29, 2019

Mark-Simulacrum commented on Mar 30, 2019

jerch commented on Apr 26, 2019

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions