Description
The following program causes UB:
use std::os::windows::ffi::{OsStrExt, OsStringExt};
use std::ffi::{OsStr, OsString};
fn main() {
let base = "a\té \u{7f}💩\r";
let mut base: Vec<u16> = OsStr::new(base).encode_wide().collect();
base.push(0xD800);
let _res = OsString::from_wide(&base);
}
Miri says:
error: Undefined Behavior: type validation failed: encountered 0x0000d800, but expected a valid unicode codepoint
--> /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/src/libcore/char/convert.rs:102:5
|
102 | transmute(i)
| ^^^^^^^^^^^^ type validation failed: encountered 0x0000d800, but expected a valid unicode codepoint
|
= help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
= help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
= note: inside `std::char::from_u32_unchecked` at /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/src/libcore/char/convert.rs:102:5
= note: inside `std::sys_common::wtf8::Wtf8Buf::push_code_point_unchecked` at /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/src/libstd/sys_common/wtf8.rs:204:26
= note: inside `std::sys_common::wtf8::Wtf8Buf::from_wide` at /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/src/libstd/sys_common/wtf8.rs:194:21
= note: inside `<std::ffi::OsString as std::os::windows::ffi::OsStringExt>::from_wide` at /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/src/libstd/sys/windows/ext/ffi.rs:101:44
note: inside `main` at wtf8.rs:8:16
--> wtf8.rs:8:16
|
8 | let _res = OsString::from_wide(&base);
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
The problem is this code:
rust/src/libstd/sys_common/wtf8.rs
Lines 293 to 305 in 96dd469
This calls push_code_point_unchecked
unless the new code point is in 0xDC00..=0xDFFF
, but what about surrogates in 0xD800..0xDC00
?
This code is unchanged since its introduction in c5369eb. I am not sure what the intended safety contract of push_code_point_unchecked
is. That method is not marked unsafe
but clearly should be -- it calls char::from_u32_unchecked
. So my guess is the safety precondition is that CodePoint
must not be part of a surrogate pair, but the thing is, push
calls it without actually ensuring that condition. The condition it does ensure is that the codepoint is not in 0xDC00..=0xDFFF
, but that does not help.
Activity
SimonSapin commentedon May 30, 2020
0xD800..0xDC00
is excluded from Unicode scalar values a.k.a.char
, but it’s fine as a Unicode code point.I think this bug is not in the contract of
push_code_point_unchecked
(which is private anyway) but in its implementation, namely going throughchar
. The fix is to either duplicate the logic ofchar::encode_utf8
intowtf8.rs
, or (to avoid duplication) move that logic to a new function in libcore that takes au32
parameter and that is called by bothWtf8::push_code_point_unchecked
andchar::encode_utf8
. This function would need to be public becausewtf8.rs
is in a different crate, but it should be prema-unstable and#[doc(hidden)]
.SimonSapin commentedon May 30, 2020
I would classify the priority of this bug as low.
Although this call to
char::from_u32_unchecked
is UB, I expect Miri checking for it explicitly is the only case where that has any consequence in today’s implementation:char::encode_utf8
behaves as expected (for the purpose of WTF-8) in the surrogate range and does not exploit this UBchar
in rustc excludes values beyondchar::MAX
but not surrogates:rust/src/librustc_middle/ty/layout.rs
Lines 505 to 508 in 0e9e408
SimonSapin commentedon May 30, 2020
By the way, the docs at https://doc.rust-lang.org/std/char/fn.from_u32_unchecked.html#safety feel insufficient:
Which values are invalid?
Similarly, https://doc.rust-lang.org/std/char/index.html and https://doc.rust-lang.org/std/primitive.char.html say that
char
represents Unicode Scalar values, but I couldn’t find it documented anywhere that constructing an out of rangechar
is Undefined Behavior rather than "merely" a logic bug.Is this specified elsewhere?
What is the closest we have to a written down normative resource that specifies what is or isn’t UB in the the Rust language? (For APIs I assume this is the responsibility of their respective doc-comments.)
RalfJung commentedon May 30, 2020
I agree the impact is low, but it's not just Miri -- this blocks #72683, which is how I discovered the problem.
That would be https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html
RalfJung commentedon May 30, 2020
But then what is "unchecked" about it? Elsewhere in that file the code is careful not to construct a
char
from aCodePoint
without checking for surrogates:rust/src/libstd/sys_common/wtf8.rs
Lines 92 to 97 in 0e9e408
That made me assume the author was aware that surrogates in
char
are UB.SimonSapin commentedon May 30, 2020
The
Wtf8
type contains a (generalized-UTF-8-encoded) sequence of code point (including potentially surrogates) that doesn’t form a surrogate pair.push_code_point_unchecked
doesn’t check for surrogate pairs.CodePoint::to_char
is a public method that is documented as returning a scalar value, therefore excluding all surrogates.Ok I looked into it, there are three different people involved.
The original code in 2014 was YOLO
transmute
https://github.com/SimonSapin/rust-wtf8/blob/76e023dfd56eef27ce36108a0182c156ededde2e/src/lib.rs#L164-L171Later in 2014, SimonSapin/rust-wtf8@8a42f9e moved to duplicating logic instead.
In 2015, PR #21488 which first imported WTF-8 support in libstd deduplicated that logic by adding a
core::char::encode_utf8_raw
function that takesu32
. (I knew that approach sounded familiar…)In 2016, PR #32204 changed the signature of
char::encode_utf8
and in passing removedencode_utf8_raw
. It madewtf8.rs
usechar::from_u32_unchecked
+char::encode_utf8
instead, presumably in the middle of updating many callers other.RalfJung commentedon May 30, 2020
Okay, so the fix would be to revert the part of #32204 that removed
encode_utf8_raw
?SimonSapin commentedon May 30, 2020
That or copy
encode_utf8_raw
intowtf8.rs
. I find them both not great, so meh.RalfJung commentedon May 30, 2020
Even with that done, there's still UB in another testcase:
This is converting something to a
char
before callingencode_utf16
-- so probably the same issue as theencode_utf8
above, just for a different encoding.6 remaining items