Skip to content

OsString::from_wide (in Windows OsStringExt) is unsound #72760

Closed
@RalfJung

Description

@RalfJung
Member

The following program causes UB:

use std::os::windows::ffi::{OsStrExt, OsStringExt};
use std::ffi::{OsStr, OsString};

fn main() {
    let base = "a\té \u{7f}💩\r";
    let mut base: Vec<u16> = OsStr::new(base).encode_wide().collect();
    base.push(0xD800);
    let _res = OsString::from_wide(&base);
}

Miri says:

error: Undefined Behavior: type validation failed: encountered 0x0000d800, but expected a valid unicode codepoint
   --> /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/src/libcore/char/convert.rs:102:5
    |
102 |     transmute(i)
    |     ^^^^^^^^^^^^ type validation failed: encountered 0x0000d800, but expected a valid unicode codepoint
    |
    = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
    = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
            
    = note: inside `std::char::from_u32_unchecked` at /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/src/libcore/char/convert.rs:102:5
    = note: inside `std::sys_common::wtf8::Wtf8Buf::push_code_point_unchecked` at /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/src/libstd/sys_common/wtf8.rs:204:26
    = note: inside `std::sys_common::wtf8::Wtf8Buf::from_wide` at /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/src/libstd/sys_common/wtf8.rs:194:21
    = note: inside `<std::ffi::OsString as std::os::windows::ffi::OsStringExt>::from_wide` at /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/src/libstd/sys/windows/ext/ffi.rs:101:44
note: inside `main` at wtf8.rs:8:16
   --> wtf8.rs:8:16
    |
8   |     let _res = OsString::from_wide(&base);
    |                ^^^^^^^^^^^^^^^^^^^^^^^^^^

The problem is this code:

pub fn push(&mut self, code_point: CodePoint) {
if let trail @ 0xDC00..=0xDFFF = code_point.to_u32() {
if let Some(lead) = (&*self).final_lead_surrogate() {
let len_without_lead_surrogate = self.len() - 3;
self.bytes.truncate(len_without_lead_surrogate);
self.push_char(decode_surrogate_pair(lead, trail as u16));
return;
}
}
// No newly paired surrogates at the boundary.
self.push_code_point_unchecked(code_point)
}

This calls push_code_point_unchecked unless the new code point is in 0xDC00..=0xDFFF, but what about surrogates in 0xD800..0xDC00?

This code is unchanged since its introduction in c5369eb. I am not sure what the intended safety contract of push_code_point_unchecked is. That method is not marked unsafe but clearly should be -- it calls char::from_u32_unchecked. So my guess is the safety precondition is that CodePoint must not be part of a surrogate pair, but the thing is, push calls it without actually ensuring that condition. The condition it does ensure is that the codepoint is not in 0xDC00..=0xDFFF, but that does not help.

Activity

added
I-unsoundIssue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness
on May 29, 2020
added
C-bugCategory: This is a bug.
O-windowsOperating system: Windows
T-libsRelevant to the library team, which will review and decide on the PR/issue.
on May 29, 2020
added
I-prioritizeIssue: Indicates that prioritization has been requested for this issue.
on May 30, 2020
SimonSapin

SimonSapin commented on May 30, 2020

@SimonSapin
Contributor

0xD800..0xDC00 is excluded from Unicode scalar values a.k.a. char, but it’s fine as a Unicode code point.

I think this bug is not in the contract of push_code_point_unchecked (which is private anyway) but in its implementation, namely going through char. The fix is to either duplicate the logic of char::encode_utf8 into wtf8.rs, or (to avoid duplication) move that logic to a new function in libcore that takes a u32 parameter and that is called by both Wtf8::push_code_point_unchecked and char::encode_utf8. This function would need to be public because wtf8.rs is in a different crate, but it should be prema-unstable and #[doc(hidden)].

SimonSapin

SimonSapin commented on May 30, 2020

@SimonSapin
Contributor

I would classify the priority of this bug as low.

Although this call to char::from_u32_unchecked is UB, I expect Miri checking for it explicitly is the only case where that has any consequence in today’s implementation:

  • char::encode_utf8 behaves as expected (for the purpose of WTF-8) in the surrogate range and does not exploit this UB
  • The "layout" of char in rustc excludes values beyond char::MAX but not surrogates:

ty::Char => tcx.intern_layout(Layout::scalar(
self,
Scalar { value: Int(I32, false), valid_range: 0..=0x10FFFF },
)),

SimonSapin

SimonSapin commented on May 30, 2020

@SimonSapin
Contributor

By the way, the docs at https://doc.rust-lang.org/std/char/fn.from_u32_unchecked.html#safety feel insufficient:

Safety

This function is unsafe, as it may construct invalid char values.

For a safe version of this function, see the from_u32 function.

Which values are invalid?


Similarly, https://doc.rust-lang.org/std/char/index.html and https://doc.rust-lang.org/std/primitive.char.html say that char represents Unicode Scalar values, but I couldn’t find it documented anywhere that constructing an out of range char is Undefined Behavior rather than "merely" a logic bug.

Is this specified elsewhere?

What is the closest we have to a written down normative resource that specifies what is or isn’t UB in the the Rust language? (For APIs I assume this is the responsibility of their respective doc-comments.)

RalfJung

RalfJung commented on May 30, 2020

@RalfJung
MemberAuthor

Although this call to char::from_u32_unchecked is UB, I expect Miri checking for it explicitly is the only case where that has any consequence in today’s implementation

I agree the impact is low, but it's not just Miri -- this blocks #72683, which is how I discovered the problem.

What is the closest we have to a written down normative resource that specifies what is or isn’t UB in the the Rust language?

That would be https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html

RalfJung

RalfJung commented on May 30, 2020

@RalfJung
MemberAuthor

I think this bug is not in the contract of push_code_point_unchecked (which is private anyway) but in its implementation, namely going through char. The fix is to either duplicate the logic of char::encode_utf8 into wtf8.rs, or (to avoid duplication) move that logic to a new function in libcore that takes a u32 parameter and that is called by both Wtf8::push_code_point_unchecked and char::encode_utf8. This function would need to be public because wtf8.rs is in a different crate, but it should be prema-unstable and #[doc(hidden)].

But then what is "unchecked" about it? Elsewhere in that file the code is careful not to construct a char from a CodePoint without checking for surrogates:

pub fn to_char(&self) -> Option<char> {
match self.value {
0xD800..=0xDFFF => None,
_ => Some(unsafe { char::from_u32_unchecked(self.value) }),
}
}

That made me assume the author was aware that surrogates in char are UB.

SimonSapin

SimonSapin commented on May 30, 2020

@SimonSapin
Contributor

The Wtf8 type contains a (generalized-UTF-8-encoded) sequence of code point (including potentially surrogates) that doesn’t form a surrogate pair. push_code_point_unchecked doesn’t check for surrogate pairs.

CodePoint::to_char is a public method that is documented as returning a scalar value, therefore excluding all surrogates.

That made me assume the author was aware that surrogates in char are UB.

Ok I looked into it, there are three different people involved.

The original code in 2014 was YOLO transmute https://github.com/SimonSapin/rust-wtf8/blob/76e023dfd56eef27ce36108a0182c156ededde2e/src/lib.rs#L164-L171

Later in 2014, SimonSapin/rust-wtf8@8a42f9e moved to duplicating logic instead.

In 2015, PR #21488 which first imported WTF-8 support in libstd deduplicated that logic by adding a core::char::encode_utf8_raw function that takes u32. (I knew that approach sounded familiar…)

In 2016, PR #32204 changed the signature of char::encode_utf8 and in passing removed encode_utf8_raw. It made wtf8.rs use char::from_u32_unchecked + char::encode_utf8 instead, presumably in the middle of updating many callers other.

RalfJung

RalfJung commented on May 30, 2020

@RalfJung
MemberAuthor

Okay, so the fix would be to revert the part of #32204 that removed encode_utf8_raw?

SimonSapin

SimonSapin commented on May 30, 2020

@SimonSapin
Contributor

That or copy encode_utf8_raw into wtf8.rs. I find them both not great, so meh.

RalfJung

RalfJung commented on May 30, 2020

@RalfJung
MemberAuthor

Even with that done, there's still UB in another testcase:

use std::os::windows::ffi::{OsStrExt, OsStringExt};
use std::ffi::{OsStr, OsString};

fn main() {
    let mut base: Vec<u16> = OsStr::new("aé ").encode_wide().collect();
    base.push(0xD83D);
    let mut _res: Vec<u16> = OsString::from_wide(&base).encode_wide().collect();
}

This is converting something to a char before calling encode_utf16 -- so probably the same issue as the encode_utf8 above, just for a different encoding.

6 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-bugCategory: This is a bug.I-unsoundIssue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/SoundnessO-windowsOperating system: WindowsT-libsRelevant to the library team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @SimonSapin@RalfJung@jonas-schievink@jyn514@JohnTitor

      Issue actions

        OsString::from_wide (in Windows OsStringExt) is unsound · Issue #72760 · rust-lang/rust