-
Notifications
You must be signed in to change notification settings - Fork 33
Closed
Description
We're using Net::IMAP
via the MailRoom gem, and quite frequently we are seeing issues with the process not terminating even though we attempt to run Thread#join
with a 60-second timeout.
A GDB backtrace shows that this is stuck waiting for a conditional variable:
Thread 2 (Thread 0x7f2bf81ff700 (LWP 11)):
#0 futex_wait_cancelable (private=0, expected=0, futex_word=0x7f2bfa3df810) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x7f2bfa3df840, cond=0x7f2bfa3df7e8) at pthread_cond_wait.c:502
#2 __pthread_cond_wait (cond=cond@entry=0x7f2bfa3df7e8, mutex=mutex@entry=0x7f2bfa3df840) at pthread_cond_wait.c:655
#3 0x00007f2bffe343fd in rb_native_cond_wait (mutex=0x7f2bfa3df840, cond=0x7f2bfa3df7e8) at thread_pthread.c:1197
#4 native_cond_sleep (th=th@entry=0x7f2bfa3df780, rel=rel@entry=0x0) at thread_pthread.c:1197
#5 0x00007f2bffe346f4 in native_sleep (th=th@entry=0x7f2bfa3df780, rel=0x0) at thread_pthread.c:2113
#6 0x00007f2bffe376de in sleep_forever (fl=1, th=0x7f2bfa3df780) at thread.c:1252
#7 rb_thread_sleep_deadly_allow_spurious_wakeup () at thread.c:1343
#8 rb_mutex_sleep_forever (time=time@entry=8) at thread_sync.c:451
#9 0x00007f2bffcce134 in rb_ensure (b_proc=b_proc@entry=0x7f2bffe37660 <rb_mutex_sleep_forever>, data1=data1@entry=8, e_proc=e_proc@entry=0x7f2bffe36160 <mutex_lock_uninterruptible>,
data2=data2@entry=139826840321280) at eval.c:1129
#10 0x00007f2bffe361bd in rb_mutex_sleep (self=139826840321280, timeout=<optimized out>) at thread_sync.c:477
#11 0x00007f2bffe796e1 in vm_call0_cfunc_with_frame (argv=0x7f2bf80fc858, cd=<optimized out>, calling=0x7f2bf80fc700, ec=0x7f2bfa209bd0) at vm_eval.c:91
#12 vm_call0_cfunc (argv=0x7f2bf80fc858, cd=<optimized out>, calling=0x7f2bf80fc700, ec=0x7f2bfa209bd0) at vm_eval.c:105
#13 vm_call0_body (ec=0x7f2bfa209bd0, calling=0x7f2bf80fc700, cd=<optimized out>, argv=0x7f2bf80fc858) at vm_eval.c:140
#14 0x00007f2bffe7bfd8 in rb_funcallv_with_cc (cd=0x7f2bfffa75c0 <rb_funcallv_data>, recv=139826840321280, mid=24193, argc=1, argv=<optimized out>) at vm_core.h:1805
#15 0x00007f2bffcce134 in rb_ensure (b_proc=b_proc@entry=0x7f2bffe2f360 <do_sleep>, data1=data1@entry=139826822105168, e_proc=e_proc@entry=0x7f2bffe2df60 <delete_from_waitq>, data2=data2@entry=139826822105184)
at eval.c:1129
#16 0x00007f2bffe2f472 in rb_condvar_wait (argc=<optimized out>, argv=0x7f2bf80fcad0, self=139826858411400) at thread_sync.c:1407
#17 0x00007f2bffe796e1 in vm_call0_cfunc_with_frame (argv=0x7f2bf80fcad0, cd=<optimized out>, calling=0x7f2bf80fc980, ec=0x7f2bfa209bd0) at vm_eval.c:91
#18 vm_call0_cfunc (argv=0x7f2bf80fcad0, cd=<optimized out>, calling=0x7f2bf80fc980, ec=0x7f2bfa209bd0) at vm_eval.c:105
#19 vm_call0_body (ec=0x7f2bfa209bd0, calling=0x7f2bf80fc980, cd=<optimized out>, argv=0x7f2bf80fcad0) at vm_eval.c:140
#20 0x00007f2bffe79d5f in rb_vm_call0 (ec=ec@entry=0x7f2bfa209bd0, recv=recv@entry=139826858411400, id=id@entry=24353, argc=<optimized out>, argv=<optimized out>, me=me@entry=0x7f2bfef8b9d0, kw_splat=0)
at vm_eval.c:52
#21 0x00007f2bffe79ff8 in rb_vm_call_kw (ec=0x7f2bfa209bd0, recv=139826858411400, id=24353, argc=<optimized out>, argc@entry=2, argv=<optimized out>, argv@entry=0x7f2bf80fcab0, me=0x7f2bfef8b9d0,
kw_splat=<optimized out>) at vm_eval.c:268
#22 0x00007f2bffe7a453 in rb_call0 (ec=<optimized out>, recv=<optimized out>, mid=<optimized out>, argc=argc@entry=2, argv=argv@entry=0x7f2bf80fcab0, call_scope=call_scope@entry=CALL_FCALL,
self=<optimized out>) at vm_eval.c:392
#23 0x00007f2bffe7ac7e in rb_call (scope=CALL_FCALL, argv=0x7f2bf80fcab0, argc=2, mid=<optimized out>, recv=<optimized out>) at vm_eval.c:718
#24 rb_funcallv (recv=<optimized out>, mid=<optimized out>, argc=argc@entry=2, argv=argv@entry=0x7f2bf80fcad0) at vm_eval.c:958
#25 0x00007f2bff2f62e1 in monitor_wait_for_cond_body (v=v@entry=139826822106112) at monitor.c:152
#26 0x00007f2bffcce134 in rb_ensure (b_proc=b_proc@entry=0x7f2bff2f62a0 <monitor_wait_for_cond_body>, data1=data1@entry=139826822106112, e_proc=e_proc@entry=0x7f2bff2f64a0 <monitor_enter_for_cond>,
data2=data2@entry=139826822106112) at eval.c:1129
#27 0x00007f2bff2f641e in monitor_wait_for_cond (monitor=139826840321320, cond=<optimized out>, timeout=8) at monitor.c:180
#28 0x00007f2bffe66c59 in vm_call_cfunc_with_frame (empty_kw_splat=<optimized out>, cd=0x7f2bfe9c8b70, calling=<optimized out>, reg_cfp=0x7f2bf81fd268, ec=0x7f2bfa209bd0) at vm_insnhelper.c:2514
#29 vm_call_cfunc (ec=0x7f2bfa209bd0, reg_cfp=0x7f2bf81fd268, calling=<optimized out>, cd=0x7f2bfe9c8b70) at vm_insnhelper.c:2539
#30 0x00007f2bffe72092 in vm_sendish (block_handler=<optimized out>, method_explorer=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at vm_insnhelper.c:4023
#31 vm_exec_core (ec=0x7f2bfa3df810, initial=128) at insns.def:801
#32 0x00007f2bffe78365 in rb_vm_exec (ec=0x7f2bfa209bd0, mjit_enable_p=1) at vm.c:1929
#33 0x00007f2bffe82130 in invoke_block (captured=0x7f2bf81fd3d0, opt_pc=<optimized out>, type=<optimized out>, cref=<optimized out>, self=139826840321400, iseq=0x7f2bfef91600, ec=0x7f2bfa209bd0) at vm.c:1044
#34 invoke_iseq_block_from_c (me=0x0, is_lambda=0, cref=<optimized out>, passed_block_handler=<optimized out>, kw_splat=<optimized out>, argv=<optimized out>, argc=0, self=139826840321400,
captured=0x7f2bf81fd3d0, ec=0x7f2bfa209bd0) at vm.c:1116
#35 invoke_block_from_c_bh (argc=<optimized out>, argv=<optimized out>, kw_splat=<optimized out>, passed_block_handler=<optimized out>, cref=<optimized out>, is_lambda=<optimized out>,
force_blockarg=<optimized out>, block_handler=<optimized out>, ec=<optimized out>) at vm.c:1134
#36 vm_yield (argc=<optimized out>, argv=<optimized out>, kw_splat=<optimized out>, ec=<optimized out>) at vm.c:1179
#37 rb_yield_0 (argv=<optimized out>, argc=<optimized out>) at vm_eval.c:1227
#38 rb_yield_values (n=<optimized out>) at vm_eval.c:1252
#39 0x00007f2bffe82eef in vm_yield (kw_splat=0, argv=<optimized out>, argc=<optimized out>, ec=<optimized out>) at vm.c:1165
#40 rb_yield_0 (argv=<optimized out>, argc=<optimized out>) at vm_eval.c:1227
#41 rb_yield_values2 (argc=<optimized out>, argv=<optimized out>) at vm_eval.c:1273
#42 0x00007f2bffcce134 in rb_ensure (b_proc=0x7f2bff2f6380 <monitor_sync_body>, data1=139826840321320, e_proc=<optimized out>, data2=<optimized out>) at eval.c:1129
#43 0x00007f2bffe66c59 in vm_call_cfunc_with_frame (empty_kw_splat=<optimized out>, cd=0x7f2bfa4f19e0, calling=<optimized out>, reg_cfp=0x7f2bf81fd380, ec=0x7f2bfa209bd0) at vm_insnhelper.c:2514
#44 vm_call_cfunc (ec=0x7f2bfa209bd0, reg_cfp=0x7f2bf81fd380, calling=<optimized out>, cd=0x7f2bfa4f19e0) at vm_insnhelper.c:2539
#45 0x00007f2bffe72150 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at vm_insnhelper.c:4023
#46 vm_exec_core (ec=0x7f2bfa3df810, initial=128) at insns.def:782
#47 0x00007f2bffe78365 in rb_vm_exec (ec=0x7f2bfa209bd0, mjit_enable_p=1) at vm.c:1929
#48 0x00007f2bffe792bc in invoke_iseq_block_from_c (me=0x0, is_lambda=<optimized out>, cref=0x0, passed_block_handler=0, kw_splat=<optimized out>, argv=<optimized out>, argc=0, self=139826857723800,
captured=<optimized out>, ec=0x7f2bfa209bd0) at vm.c:1116
#49 invoke_block_from_c_proc (me=0x0, is_lambda=<optimized out>, passed_block_handler=0, kw_splat=<optimized out>, argv=<optimized out>, argc=0, self=139826857723800, proc=0x7f2bfa209bd0, ec=0x7f2bfa209bd0)
at vm.c:1216
#50 vm_invoke_proc (passed_block_handler=0, kw_splat=<optimized out>, argv=<optimized out>, argc=0, self=139826857723800, proc=0x7f2bfa209bd0, ec=0x7f2bfa209bd0) at vm.c:1238
#51 rb_vm_invoke_proc (ec=0x7f2bfa209bd0, proc=proc@entry=0x7f2bf98a4510, argc=0, argv=<optimized out>, kw_splat=<optimized out>, passed_block_handler=passed_block_handler@entry=0) at vm.c:1259
#52 0x00007f2bffe3646d in thread_do_start (th=0x7f2bfa3df780) at thread.c:697
#53 0x00007f2bffe384ef in thread_start_func_2 (th=0x7f2bfa3df780, stack_start=<optimized out>) at thread.c:745
#54 0x00007f2bffe38a34 in thread_start_func_1 (th_ptr=<optimized out>) at thread_pthread.c:969
#55 0x00007f2bff9c5fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#56 0x00007f2bff4574cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 1 (Thread 0x7f2bff33ff80 (LWP 7)):
#0 0x00007f2bff44c916 in __GI_ppoll (fds=fds@entry=0x7fff52ea2f78, nfds=nfds@entry=1, timeout=<optimized out>, timeout@entry=0x7fff52ea2f80, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1 0x00007f2bffe33880 in rb_sigwait_sleep (th=th@entry=0x7f2bfee2d000, sigwait_fd=sigwait_fd@entry=3, rel=rel@entry=0x7fff52ea3088) at hrtime.h:148
#2 0x00007f2bffe34589 in native_sleep (th=0x7f2bfee2d000, rel=0x7fff52ea3088) at thread_pthread.c:2099
#3 0x00007f2bffe37b64 in rb_thread_terminate_all () at thread.c:587
#4 0x00007f2bffccfbaa in rb_ec_cleanup (ec=ec@entry=0x7f2bfee09050, ex=<optimized out>) at eval.c:218
#5 0x00007f2bffccfed3 in ruby_run_node (n=0x7f2bfa503a98) at eval.c:336
#6 0x000055dabf0fe10b in main (argc=<optimized out>, argv=<optimized out>) at ./main.c:50
Net::IMAP
uses several conditional variables:
idle
: https://github.com/ruby/ruby/blob/48f324e92f9b36edc267f9871e35039cbd1c2eb9/lib/net/imap.rb#L965get_tagged_response
: https://github.com/ruby/ruby/blob/48f324e92f9b36edc267f9871e35039cbd1c2eb9/lib/net/imap.rb#L1215send_literal
: https://github.com/ruby/ruby/blob/48f324e92f9b36edc267f9871e35039cbd1c2eb9/lib/net/imap.rb#L1368
We're using a 60-second idle timeout for idle
. However, the last two do NOT have a timeout, so it's possible we're getting stuck in one of those cases.
I noticed that send_literal
only checks the state of @exception
after the wait returns. Do we need to do this?
diff --git a/lib/net/imap.rb b/lib/net/imap.rb
index 720acbc86d..2e1fd6e7b9 100644
--- a/lib/net/imap.rb
+++ b/lib/net/imap.rb
@@ -1365,6 +1365,7 @@ def send_literal(str, tag = nil)
@continued_command_tag = tag
@continuation_request_exception = nil
begin
+ raise @exception if @exception
@continuation_request_arrival.wait
e = @continuation_request_exception || @exception
raise e if e
I also wonder if we need a timeout for these other condition variables.
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
engwan commentedon Mar 5, 2021
This is a sample exception from our logs that led to the process not terminating:
Looking at the code, this would have been rescued in https://github.com/ruby/ruby/blob/c1af7b1e1d408f9796a5f46c9ed36bc5adea4aa2/lib/net/imap.rb#L1148
And both conditional variables would have been woken by https://github.com/ruby/ruby/blob/c1af7b1e1d408f9796a5f46c9ed36bc5adea4aa2/lib/net/imap.rb#L1203-L1204
engwan commentedon Mar 5, 2021
I see what you mean here. I guess it's possible where the receiver thread rescues the Exception, sets
@exception
, calls#broadcast
, and then#wait
gets called after.stanhu commentedon Mar 5, 2021
As far as I can tell,
send_literal
doesn't seem to be called in our case, so I wonder if it's one the two other condition variables that's getting stuck.I can't seem to reproduce this locally. The only commonality I've found is that this has been reported when running MailRoom in Google Kubernetes Engine (GKE).
Add liveness health check support
shugo commentedon Mar 11, 2021
@stanhu Could you show Ruby-level backtrace using rb_ps defined in .gdbinit bundled with Ruby?
gdbdump-ruby may be useful.
stanhu commentedon Mar 15, 2021
@shugo A customer sent us this:
This does point to the
wait
inget_tagged_response
:net-imap/lib/net/imap.rb
Lines 1297 to 1301 in c04bf8f
Although is this output only showing one thread?
stanhu commentedon Mar 15, 2021
It appears the deadlock is in the
ensure
block after theidle
:net-imap/lib/net/imap.rb
Lines 1055 to 1060 in c04bf8f
stanhu commentedon Mar 15, 2021
This does suggest that this
synchronize
block never ran:net-imap/lib/net/imap.rb
Lines 1287 to 1294 in c04bf8f
If it did,
@receiver_thread_terminating
would have been set totrue
, thus preventing thatensure
block from running.It's possible the
synchronize
block was trying to wait forget_tagged_response
to finish, though. Maybe theget_tagged_response
inidle
needs to have a timeout as well?Add liveness health check support
Set timeout for IDLE responses
shugo commentedon Mar 16, 2021
@stanhu Thanks for your information.
Could you try #15?
It adds a new option idle_response_timeout, whose default value is 5 seconds.
Add liveness health check support
Add liveness health check support