Skip to content

beta 1.33 seems to break tarpaulin on multithreading #58104

Closed
@elinorbgr

Description

@elinorbgr
Contributor

Consider the following minimal project:

Cargo.toml

[package]
name = "tarpotest"
version = "0.1.0"
authors = ["Me <me@me.me>"]
edition = "2018"

[dependencies]
futures-executor = "0.2.1"

src/lib.rs

#[test]
pub fn a() {
    futures_executor::ThreadPool::new();
}

#[test]
pub fn b() {
    futures_executor::ThreadPool::new();
}

Install tarpaulin from crates.io:

RUSTFLAGS="--cfg procmacro2_semver_exempt" cargo install cargo-tarpaulin

And run it on this small project using either stable (rustc 1.32.0 (9fda7c223 2019-01-16)) or beta (rustc 1.33.0-beta.5 (1045131c1 2019-01-31)) rustc:

  • cargo +stable tarpaulin : works as expected
  • cargo +beta tarpaulin: occasionally, (in roughly 15% of the runs), tarpaulin errors as the test executable segfaulted
Error: Failed to run tests! Error: A segfault occurred while executing tests

I have no idea what is at fault here, but apparently something in the last beta of rustc was changed which broke the way tarpaulin does its instrumentation. For more details, the segfault appears to only occur if at least 2 tests in the same binary involve spawning threads, and seem to be more likely the more threads are spawned.

See xd009642/tarpaulin#190 for full details.

Activity

pnkfelix

pnkfelix commented on Feb 14, 2019

@pnkfelix
Member

visiting for triage. P-high for initial investigation; assigning that task to @nagisa

nagisa

nagisa commented on Feb 17, 2019

@nagisa
Member

This will be hard for me to reliably bisect because among the failures related to sigsegv I also get

[ERROR tarpaulin] Failed to get test coverage! Error: Failed to run tests! Error: Error occurred when handling ptrace event: ESRCH: No such process

Bisection says that the sigsegv began in 2fba17f (#56837), cc @nikomatsakis @arielb1

The callstack looks like this:

#0  0x0000000000414e4f in futures_executor::thread_pool::ThreadPoolBuilder::create (self=0x7fffffffc170)
    at /home/nagisa/.cargo/registry/src/github.tiyicn.workers.dev-1ecc6299db9ec823/futures-executor-0.2.1/src/thread_pool.rs:262
#1  0x0000000000413d6f in futures_executor::thread_pool::ThreadPool::new ()
    at /home/nagisa/.cargo/registry/src/github.tiyicn.workers.dev-1ecc6299db9ec823/futures-executor-0.2.1/src/thread_pool.rs:78
#2  0x000000000040424d in tarpotest::b () at src/lib.rs:8
#3  0x0000000000404a4a in tarpotest::b::{{closure}} () at src/lib.rs:7
#4  0x00000000004047ae in core::ops::function::FnOnce::call_once ()
    at /rustc/7164a9f151a56316a382d8bc2b15ccf373e129ca/src/libcore/ops/function.rs:231
#5  0x000000000046105f in test::run_test::{{closure}} () at src/libtest/lib.rs:1475
#6  core::ops::function::FnOnce::call_once ()
    at /rustc/7164a9f151a56316a382d8bc2b15ccf373e129ca/src/libcore/ops/function.rs:231
#7  <F as alloc::boxed::FnBox<A>>::call_box ()
    at /rustc/7164a9f151a56316a382d8bc2b15ccf373e129ca/src/liballoc/boxed.rs:734
#8  0x00000000004cdbfa in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:92
#9  0x000000000047f188 in std::panicking::try ()
    at /rustc/7164a9f151a56316a382d8bc2b15ccf373e129ca/src/libstd/panicking.rs:276
#10 std::panic::catch_unwind () at /rustc/7164a9f151a56316a382d8bc2b15ccf373e129ca/src/libstd/panic.rs:388
#11 test::run_test::run_test_inner::{{closure}} () at src/libtest/lib.rs:1430
#12 0x000000000047eaad in test::run_test::run_test_inner () at src/libtest/lib.rs:1452
#13 0x000000000047d22c in test::run_test () at src/libtest/lib.rs:1471
#14 0x0000000000475c8d in test::run_tests () at src/libtest/lib.rs:1150
#15 test::run_tests_console () at src/libtest/lib.rs:957
#16 0x000000000046d764 in test::test_main () at src/libtest/lib.rs:290
#17 0x000000000046dfb2 in test::test_main_static () at src/libtest/lib.rs:324
#18 0x0000000000404276 in tarpotest::main ()
#19 0x0000000000404ad0 in std::rt::lang_start::{{closure}} ()
    at /rustc/7164a9f151a56316a382d8bc2b15ccf373e129ca/src/libstd/rt.rs:64
#20 0x00000000004bbfb3 in std::rt::lang_start_internal::{{closure}} () at src/libstd/rt.rs:49
#21 std::panicking::try::do_call () at src/libstd/panicking.rs:297
#22 0x00000000004cdbfa in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:92
#23 0x00000000004bcbc6 in std::panicking::try () at src/libstd/panicking.rs:276
#24 std::panic::catch_unwind () at src/libstd/panic.rs:388
#25 std::rt::lang_start_internal () at src/libstd/rt.rs:48
#26 0x0000000000404aa9 in std::rt::lang_start (main=0x404260 <tarpotest::main>, argc=2, argv=0x7fffffffdc08)
    at /rustc/7164a9f151a56316a382d8bc2b15ccf373e129ca/src/libstd/rt.rs:64
#27 0x00000000004042ad in main ()

I won’t be able to look into it much more at the moment.

removed their assignment
on Feb 17, 2019
arielb1

arielb1 commented on Feb 17, 2019

@arielb1
Contributor

I suppose I'll have to investigate this then. Hope I can find enough time.

pnkfelix

pnkfelix commented on Feb 21, 2019

@pnkfelix
Member

triage: speculatively assigning to @arielb1 . They should hopefully unassign themselves if they are unable to look into this.

Mark-Simulacrum

Mark-Simulacrum commented on Feb 21, 2019

@Mark-Simulacrum
Member

Realistically I expect this to be a regression because we have ~4 days left before we need to produce the stable artifacts.

arielb1

arielb1 commented on Feb 21, 2019

@arielb1
Contributor

had some busy time, will try to look at this quickly

arielb1

arielb1 commented on Feb 22, 2019

@arielb1
Contributor

This appears to be triggered by the breakpoints tarpaulin places, which makes me suspect there is some race condition here we somehow create in futures. This doesn't sound good.

arielb1

arielb1 commented on Feb 22, 2019

@arielb1
Contributor

Looking at the backtrace

0x7fffffffcb40: 0x007d2210      0x00000000      0x00000001      0x00000000
0x7fffffffcb50: 0xffffcff0      0x00000000      0xffffcff0      0x00000000
0x7fffffffcb60: 0xffffcf90      0x00007fff      0x00000000      0x00000000
0x7fffffffcb70: 0x007d0910      0x00000000      0x00000000      0x00000000

   0x00000000004196ca <+90>:    mov    %rdi,0xb8(%rsp)  
   ...
   0x00000000004199c7 <+855>:   mov    0xb8(%rsp),%rcx                                                        
=> 0x00000000004199cf <+863>:   mov    %rax,0x8(%rcx)  

#0  0x00000000004199cf in futures_executor::thread_pool::ThreadPoolBuilder::create (self=0x7fffffffcf90)
    at /home/ubuntu/.cargo/registry/src/github.tiyicn.workers.dev-1ecc6299db9ec823/futures-executor-0.2.1/src/thread_pool.rs:262

(gdb) x/8x $rcx
0xffffcff0:     Cannot access memory at address 0xffffcff0

Something is smashing the stack of ThreadPoolBuilder::create (%rcx should contain the address of the return value. which is 0x7fffffffcff0, but it contains just 0xffffcff0).

arielb1

arielb1 commented on Feb 22, 2019

@arielb1
Contributor

This is getting weird. The call %rdi is not pointing to the correct place???

Similar crashes:
one crash where the stack is not 16-byte aligned, apparently starting from tarpotest::b::{{closure}}, leading to movaps crashes (how is this happening?).

arielb1

arielb1 commented on Feb 22, 2019

@arielb1
Contributor

I smell a bug in tarpaulin itself handling interrupts concurrently.

   0x0000000000403420 <+0>:     sub    $0x18,%rsp
   0x0000000000403424 <+4>:     int3   
   0x0000000000403425 <+5>:     mov    %esp,%edi -- looks suspicious, given that %rdi is truncated
   0x0000000000403427 <+7>:     callq  *0x3c7cdb(%rip)        # 0x7cb108
=> 0x000000000040342d <+13>:    mov    %rsp,%rdi
   0x0000000000403430 <+16>:    callq  0x4036b0 <core::ptr::real_drop_in_place>
   0x0000000000403435 <+21>:    add    $0x18,%rsp
   0x0000000000403439 <+25>:    retq   

Dump of assembler code for function tarpotest::b::{{closure}}:
   0x0000000000403860 <+0>:     push   %rax
   0x0000000000403861 <+1>:     mov    %rdi,(%rsp)
   0x0000000000403865 <+5>:     callq  0x403420 <tarpotest::b>
   0x000000000040386a <+10>:    callq  0x403880 <test::assert_test_result>
   0x000000000040386f <+15>:    pop    %rax
   0x0000000000403870 <+16>:    retq   
arielb1

arielb1 commented on Feb 22, 2019

@arielb1
Contributor

It appears that in bad traces, somehow the SIGTRAP signal is never received for the second breakpoint (at 0x403820) despite that byte being an 0xcc, which occurs like 5 instructions after the first one (at 0x403860), which feels weird and hard to be a miscompilation.

bad trace:

.wait status=Stopped(Pid(16463), SIGTRAP)                                                                                                                                                          %rip=403861, at [403420] is ffe789cc18ec8348, rsp=7fffffffd018 rdi=7fffffffd028                                                                                                              wait status=PtraceEvent(Pid(16465), SIGTRAP, 6)                                                                                                                                                              
wait status=Stopped(Pid(16463), SIGTRAP)                                                                                                                                                                     
%rip=403861, at [403420] is ffe789cc18ec8348, rsp=7fffffffd010 rdi=7fffffffd028
writing fbb6e8243c894850>fbb6e8243c894850 to [403860]                         
wait status=Exited(Pid(16465), 0)                                                                   
wait status=PtraceEvent(Pid(16467), SIGTRAP, 6)                              
wait status=PtraceEvent(Pid(16468), SIGTRAP, 6)     
wait status=Exited(Pid(16467), 0)                                            
wait status=Exited(Pid(16468), 0)                                      
wait status=PtraceEvent(Pid(16469), SIGTRAP, 6)                              
wait status=PtraceEvent(Pid(16466), SIGTRAP, 6)  
wait status=PtraceEvent(Pid(16470), SIGTRAP, 6)                               
wait status=PtraceEvent(Pid(16464), SIGTRAP, 6)        
wait status=Exited(Pid(16469), 0)                                              
wait status=PtraceEvent(Pid(16471), SIGTRAP, 6)               
wait status=Exited(Pid(16470), 0)                                         
wait status=Exited(Pid(16471), 0)          
wait status=Exited(Pid(16466), 0)       
wait status=Exited(Pid(16464), 0)
wait status=PtraceEvent(Pid(16463), SIGTRAP, 3)                                      
wait status=Stopped(Pid(16473), SIGSTOP)         
wait status=PtraceEvent(Pid(16463), SIGTRAP, 3)     
wait status=Stopped(Pid(16474), SIGSTOP)        
wait status=Stopped(Pid(16472), SIGSTOP)                                
wait status=PtraceEvent(Pid(16463), SIGTRAP, 3)                                      
wait status=Stopped(Pid(16475), SIGSTOP)                                       
wait status=PtraceEvent(Pid(16463), SIGTRAP, 3)  
wait status=Stopped(Pid(16476), SIGSTOP)                                     
wait status=PtraceEvent(Pid(16463), SIGTRAP, 3)                   
wait status=PtraceEvent(Pid(16463), SIGTRAP, 3)                           
wait status=Stopped(Pid(16478), SIGSTOP)
wait status=PtraceEvent(Pid(16463), SIGTRAP, 3)
wait status=Stopped(Pid(16479), SIGSTOP)            
wait status=Stopped(Pid(16463), SIGSEGV)       
[ERROR tarpaulin] Failed to get test coverage! Error: Failed to run tests! Error: A segfault occurred while executing tests

good trace:

.wait status=Stopped(Pid(16438), SIGTRAP)                                                                                                                                                                    
breakpoint at rip=403861                                                                                                                                                                                     
%rip=value is 403861, at [403420] is ffe789cc18ec8348, rsp=7fffffffd018 rdi=7fffffffd028                                                                                                                        
writing fbb6e8243c8948cc>fbb6e8243c894850 to [403860]                                                                                                                                                         
wait status=PtraceEvent(Pid(16440), SIGTRAP, 6)                                                                                                                                                              
wait status=Stopped(Pid(16438), SIGTRAP)                                                                                                                                                                     
breakpoint at rip=403861                                                                                                                                                                                     
%rip=403861, at [403420] is ffe789cc18ec8348, rsp=7fffffffd010 rdi=7fffffffd028                                                                                                                                                                             
writing fbb6e8243c894850>fbb6e8243c894850 to [403860]
wait status=PtraceEvent(Pid(16442), SIGTRAP, 6)                                                                                                                                                              
wait status=Stopped(Pid(16438), SIGTRAP)                                                                                                                                                                     
breakpoint at rip=403425                                                                                                                                                                                     
%rip=403425, at [403420] is ffe789cc18ec8348, rsp=7fffffffcff0 rdi=7fffffffd028                                                                                          
writing ffe789cc18ec8348>ffe7894818ec8348 to [403420]                                                                                                                                                         
wait status=Stopped(Pid(16438), SIGTRAP)                                                                                                                                                                     
breakpoint at rip=403427
...
arielb1

arielb1 commented on Feb 22, 2019

@arielb1
Contributor

Reproduction Status

I can reproduce the segfault on 244b05d - before 2fba17f (#56837), which means the bisection was wrong. I used tarpaulin f95f717222e708f6bf0f7c5d2ee0b8f63a9f5c7e (from Jan 21).

This also doesn't feel to me like quite a miscompilation bug, but rather some ptrace oddity in tarpaulin. It would be nice if someone who understands ptrace would look at this.

nagisa

nagisa commented on Feb 23, 2019

@nagisa
Member

@arielb1 weird, I even did bisect twice running the reproducer 32 times at each step on each bors merge, and it ended up pointing out the same commit both times. And I was thinking that the PR indeed looked very unlikely to be the actual cause.

I bisected again, based on prebuilt nightlies, twice, running the reproduction 64 times at each step and it points out a range of 2018-12-08 to 2018-12-14 – way before your PR landed. Sorry for the false alert!

I will run bisection on merges between 2018-12-08 and 2018-12-14 now.

nagisa

nagisa commented on Feb 23, 2019

@nagisa
Member

Hit 3499575 (#56243). The change seems significantly more relevant, and also something that broke multiple things in the past, mostly pointing to invalid assumptions in users’ code, which also matches @arielb1’s investigation results and feelings.

arielb1

arielb1 commented on Feb 23, 2019

@arielb1
Contributor

In that case, there is probably nothing actionable on our side.

pnkfelix

pnkfelix commented on Feb 28, 2019

@pnkfelix
Member

Nominating for discussion at T-compiler meeting. My predisposition is to close this bug as non-actionable.

pnkfelix

pnkfelix commented on Apr 4, 2019

@pnkfelix
Member

closing as non-actionable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

P-highHigh priorityT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.regression-from-stable-to-betaPerformance or correctness regression from stable to beta.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @pnkfelix@nagisa@arielb1@elinorbgr@Mark-Simulacrum

      Issue actions

        beta 1.33 seems to break tarpaulin on multithreading · Issue #58104 · rust-lang/rust