Description
Rust implements its own userspace "stack guard", the purpose is to print a nice error message about "stack overflow" rather than segfault. This is applied to each new thread, as well as the main thread:
https://github.com/rust-lang/rust/blob/master/src/libstd/rt.rs#L44
https://github.com/rust-lang/rust/blob/master/src/libstd/sys/unix/thread.rs#L248
https://github.com/rust-lang/rust/blob/master/src/libstd/sys/unix/thread.rs#L229
The start address of the stack is calculated effectively via pthread_attr_getstack (pthread_getattr_np (pthread_self))
This is where the problems occur. pthread_getattr_np
is not defined by POSIX and the manual page does not specifically define what the exact behaviour is when getting the attributes of the main thread. It works fine for non-main threads because they are usually created with a fixed-sized stack which does not automatically expand. So the start address is pre-defined on those threads.
However on Linux (and other systems) the real size of the main stack is not determined in advance; it starts off small and then gets automatically expanded via the process described in great detail in recent articles on the stack-clash bug. In practise, with glibc the above series of function calls returns top-of-stack - stack-rlimit
when called on the main thread. This is 8MB by default on most machines I've seen.
However, most of the space in between is not allocated at the start of the program. For example, a test "Hello World" Rust program has this (after init):
3fffff800000-3fffff810000 ---p 00000000 00:00 0
3ffffffd0000-400000000000 rw-p 00000000 00:00 0 [stack]
with ulimit -s unlimited
it looks like this:
1000002d0000-1000002e0000 ---p 00000000 00:00 0
3ffffffd0000-400000000000 rw-p 00000000 00:00 0 [stack]
OTOH, the Linux stack guard is not a physical guard page but just extra logic that prevents the stack from growing too close to another mmap allocation. If I understand correctly: Contrary to the get_stack_start
function, it does not work based on the stack rlimit, because this could be unlimited. Instead, it works based on the real size of the existing allocated stack. The guard then ensures that the next-highest mapped page remains more than stack_guard_gap
below the lowest stack address, and if not then it will trigger a segfault.
On ppc64el Debian and other systems (Fedora aarch64, Fedora ppc64be, etc) the page size is 64KB. Previously, stack_guard_gap was equal to PAGESIZE. Now, it is 256 * PAGESIZE = 16MB, compared to the default stack size limit of 8MB. So now when Linux tries to expand the stack, it sees that the stack is only (8MB - $existing-size) away from the next-highest mmapped page (Rust's own stack guard) which is smaller than stack_guard_gap (16MB) and so it segfaults.
The logic only "didn't fail" before because the stack_guard_gap was much lower than the default stack rlimit. But even here, it would not have been able to perform its intended purpose of being able to detect a stack overflow, since the kernel's stack-guard logic would have caused a segfault before the real stack ever expanded into Rust's own stack guard.
In case my words aren't the best, here is a nice diagram instead:
-16MB-x -16MB -8MB x top
| | | | |
--------------------------------------------[A]-----------------[<-- stack]
G S
[..]
are mapped pages. Now, Linux's stack guard will segfault if there is anything between G and S. For Rust, its own stack guard page at A causes this. Previously, G-S was much smaller and A was lower than G.
AIUI, Linux developers are talking at the moment about the best way to "unbreak" programs that do this - they try not to break userspace. But it is nevertheless incorrect behaviour by Rust to do this anyways, and a better way of doing it should be found. Unfortunately I don't have any better ideas at the moment, since the very notion of "stack start address" for main threads is apparently not set in stone by POSIX or other standards, leading to this sort of misunderstanding between kernel vs userspace on where the "start" really is.
I'm not sure about the details of other systems, but if they implement stack guards like how Linux does it (i.e. against the real allocated stack rather than against a stack rlimit), and the relevant numbers match up like they do above, then Rust's main stack guard would also problems there.
This causes rust-lang/cargo#4197
Activity
[-]Rust's main thread guard stack is implemented incorrectly; breaks nearly all Rust programs on 64KB-pagesize post-stack-clash Linux[/-][+]Rust's main-stack thread guard is implemented incorrectly; breaks nearly all Rust programs on 64KB-pagesize post-stack-clash Linux[/+]infinity0 commentedon Jul 4, 2017
Some test code in case someone wants to play with it themselves on a 64KB-pagesize system:
jonas-schievink commentedon Jul 4, 2017
Just to clarify: Does this mean that this mechanism has always caused a segfault when the stack grew to the page just before (or after, since it grows backwards) the guard page? This would mean that Rust's guard page was never hit on Linux.
infinity0 commentedon Jul 4, 2017
It seems so - you can set up a similar test even on amd64 (4KB pagesize) pre-stack-clash with
fn main() { println!("Hello world!"); }
andsh -c 'ulimit -s 24; ./test'
, if I set this to 20 or 16 I start seeing occasional segfaults, I never see any "stack overflow" message from Rust. Non-main threads should be fine, though.cuviper commentedon Jul 4, 2017
That may just be that you're setting that amd64 stack too small for even the init calls to work. I can get the message from the main thread trivially with recursion, on an older kernel:
But on an x86_64 stack-clash-patched kernel, that just seg-faults.
So I guess we can say that stack-clash kernels make the guard page useless with 4KB pages, and actively harmful with 64KB kernels. I think it's reasonable to blame the kernel's behavior though.
cuviper commentedon Jul 4, 2017
cc @alexcrichton and #42816. While stack probing is currently x86-only, I think it will also have the effect of hitting the kernel's safety margin before anything will reach the custom guard page.
infinity0 commentedon Jul 4, 2017
I can confirm that your test program prints the stack overflow message, but it seems that the rust guard page is still not getting hit, the segfault gets raised and caught by rust without hitting rust's guard page:
I wonder why this doesn't happen on the post-stack-clash kernels though..
cuviper commentedon Jul 4, 2017
Ah, I think it's because the init ends up using an offset two pages above its base, and then the signal handler looks for faults only one page below that. So in fact it's never looking for the real guard page, but the one above it that the kernel triggers itself. And in post-stack-clash kernels, the kernel now triggers SIGSEGV many pages above, so it doesn't even match that offset=2 any more.
infinity0 commentedon Jul 4, 2017
Ahhh, OK. So it actually sounds like, whoever originally wrote this code, knew about the actual Linux stack guard behaviour as described above, and hardcoded these offsets in appropriately. They just didn't expect it to change from kernel to kernel. If there is a way to get the kernel stack guard size at runtime, rust's stack guard could actually be made to work without additional kernel patches, we would just need to adjust the offsets in the two pieces of code that you linked. (edit: Though, the behaviour in the case of an unlimited stack would not be very reliable.)
cuviper commentedon Jul 4, 2017
I think at least on Linux, we might be better not to add our own guard page to the main thread at all. We only need a way to determine if a particular fault was stack-related for reporting purposes.
infinity0 commentedon Jul 4, 2017
@aturon It looks like you wrote the original code in 2b3477d, do you have any comments?
alexcrichton commentedon Jul 4, 2017
Hm I'm a little confused and don't quite understand the OP. It makes sense to me that the main thread has "special stack growth functionality" which manually unmapping some page inbetween messes with. What I don't understand is how this relates to page size or what the meaning of a "post-stack-clash kernel" is. What was changed in the kernel? This is just a problem of error reporting, right?
@cuviper note that #42816 can't land right now due to a test failing on Travis but it's passing locally for me (specifically a test that the main thread stack overflowing prints a message). I wonder if that's related to this?
Is there a recourse for us to deterministically catch stack overflow on the main thread? Or can we only do that for child threads?
tavianator commentedon Jul 4, 2017
@alexcrichton "Stack Clash" is a series of vulnerabilities reported recently in a bunch of different projects. Ultimately they came down to triggering some stack overflow vulnerabilities to jump over the stack guard page directly into the heap, resulting in exploits rather than just crashes.
A mitigation was added to recent Linux kernels that significantly widens the size of the stack guard, making this exploitation technique much more difficult. The new size of the guard is 256 pages instead of 1 page, hence the relation to the page size (though Linus recently suggested a hard-coded 1MiB gap instead of a page-size-dependent gap).
I don't think this is just a problem of error reporting. The issue is that if the main thread stack ever needs to grow, even if there is room above Rust's "manual" guard page, there is no longer room for the stack plus the kernel's new gigantic guard area. This results in a stack overflow segfault for any stack usage above the initially mapped area, well below the expected limit of 8MiB.
cuviper commentedon Jul 4, 2017
I'll try to restate it.
When a program starts, only a relatively small stack is actually mapped to begin with. If you page-fault past the end, the kernel will dynamically add more pages up to the rlimit. But if a new stack page would end up right on top of some existing mapping, the kernel will refuse and you'll get a SIGSEGV. Nobody wants the stack to seemlessly flow into an unrelated mapping, after all.
Rust is adding a
PROT_NONE
guard page at the "bottom" of the rlimited stack. As far as the kernel is concerned, this is an unrelated mapping, so now it starts enforcing that the stack can't grow to a point that would neighbor that mapping. So with our page in place, the kernel will fault when the stack reaches the page above it. That's why our handler has to use offset=2 for seeing if a faulting address is a stack overflow.The new kernel behavior is that instead of enforcing just a one page gap between stack growth and other mappings, the kernel now enforces 256 pages. So now our guard page causes the kernel to fault any access in any of the 256 pages above that. We do get the right
SIGSEGV
, but our handler doesn't recognize it as a stack overflow. (I guess it still would if you happened to jump right to the page above our guard.)On 4KB systems, 256 pages is 1MB, so the placement of our guard page makes the last 1MB of stack unusable. On 64KB systems, 256 pages makes 16MB unusable. With rlimit typically only 8MB, our guard page makes it so the stack effectively can't grow at all.
And yes, it does seem likely that this is why #42816 is getting raw SIGSEGVs instead of reporting the stack overflow nicely. I'd guess your own kernel does not have these updates yet, and CI does.
alexcrichton commentedon Jul 4, 2017
Thanks for the info @cuviper and @tavianator! That all makes sense to me.
Sounds like detection of a stack overflow on the main thread right now basically isn't possible? It seems to me that we should remove the main thread guard page on Linux (as the kernel already has it anyway) and rely on #42816 to guarantee for sure we hit it. Does that sound right?
40 remaining items