You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafficserver.apache.org by "Gancho Tenev (JIRA)" <ji...@apache.org> on 2016/10/11 15:02:20 UTC

[jira] [Comment Edited] (TS-4916) Http2ConnectionState::restart_streams infinite loop causes deadlock

    [ https://issues.apache.org/jira/browse/TS-4916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565557#comment-15565557 ] 

Gancho Tenev edited comment on TS-4916 at 10/11/16 3:02 PM:
------------------------------------------------------------

[~shinrich], appreciate your comment!

I am running with version 6.2.1 and not with master.

{{DLL<>}} not being thread-safe and multiple threads manipulating it concurrently was the first thing that came to mind. Changed the code so {{client_streams_count}} ++ / -- and {{stream_list.push()}} / {{stream_list.remove()}} are called from only 2 corresponding new functions {{add_to_active_streams()}} and {{rm_from_active_streams()}}  (in {{Http2ConnectionState}}) and added assert like the following:

{code}
(1) ink_assert(this->mutex->thread_holding == this_ethread());
{code}

which never triggered in my case (9+ days already)! 

In fact traffic_server gets into the same broken state described in my previous post (broken DLL structure with few elements where the last element points to itself) reliably and consistently on all the machines I inspected (10+ at this time) and I saw the same symptoms (cache update failures skyrocket after the H2-infinite loop happens) on many more machines.

Then came up with the hypothesis described earlier and added the following assert to the {{add_to_active_streams()}} function:

{code}
(2) ink_assert(stream_list.in(stream));
{code}

which started triggering quickly (for < 1 day) on all machines I tested. It was always the case that {{stream_list.head == new_stream}} (memory chunk with the same address have been already added to stream_list).

Then changed the {{Http2Stream::delete_stream()}} to be called safely even if already deleted and added a “catch-all” call in {{Http2Stream::destroy()}} to make sure stream is deleted before destroying and the problem never happened ((1) and (2) never triggered for at least 3+ days).

Looking into it more it turned out if {{Http2Stream::do_io_close()}} gets into {{_state==HTTP2_STREAM_STATE_HALF_CLOSED_REMOTE}} (in version 6.2.1) it misses to delete the stream before it orders the stream self destruction (few lines below) so when it gets to {{Http2Stream::destroy()}} it frees the memory without removing the stream from the active list and then it runs into the H2-infinite-loop if the sequence of events described in my previous post happens.

Making sure {{Http2Stream::do_io_close()}} deletes the stream before requesting the self-destruction (with {{VC_EVENT_EOS}}) fixes the problem (it can be solved in many different ways). 

Although I can see that there might be a race condition in that part of the code I just have never run into it ((1) never triggers and all experiments fit pretty well what I described in my previous post)

After your comment I identified 4 places where we might not be holding the lock properly when modifying stream_list/client_stream_cout (not actually happening, just reading/tracing the code), added them and repeated the experiments and got the same results (including the failures (2) and the fix).

I see that the master code has changed quite a lot since 6.2.1, checked with [~zwoop] and will probably prepare a PR for a back port to 6.2.x.

Since the master code changed (i.e. {{Http2Stream::do_io_close()}}) there is a chance we don’t run into this H2-infinite-loop condition anymore (have not tested it yet) but we still use {{DLL<>}} and a memory pool in the same way so it seems possible that we can run into the same problem if we are not careful and will study the new master code more to see if we can do something to avoid it.

Cheers,
—Gancho



was (Author: gancho):
[~shinrich], appreciate your comment!

I am running with version 6.2.1 and not with master.

{{DLL<>}} not being thread-safe and multiple threads manipulating it concurrently was the first thing that came to mind. Changed the code so {{client_streams_count}} ++ / -- and {{stream_list.push()}} / {{stream_list.remove()}} are called from only 2 corresponding new functions {{add_to_active_streams()}} and {{rm_from_active_streams()}}  (in {{Http2ConnectionState}}) and added assert like the following:

{code}
(1) ink_assert(this->mutex->thread_holding == this_ethread());
{code}

which never triggered in my case (9+ days already)! 

In fact traffic_server gets into the same broken state described in my previous post (broken DLL structure with few elements where the last element points to itself) reliably and consistently on all the machines I inspected (10+ at this time) and I saw the same symptoms (cache update failures skyrocket after the H2-infinite loop happens) on many more machines.

Then came up with the hypothesis described earlier and added the following assert to the {{add_to_active_streams()}} function:

{code}
(2) ink_assert(stream_list.in(stream));
{code}

which started triggering quickly (for < 1 day) on all machines I tested. It was always the case that {{stream_list.head == new_stream}} (memory chunk with the same address have been already added to stream_list).

Then changed the {{Http2Stream::delete_stream()}} to be called safely even if already deleted and added a “catch-all” call in {{Http2Stream::destroy()}} to make sure stream is deleted before destroying and the problem never happened ((1) and (2) never triggered for at least 3+ days).

Looking into it more it turned out if {{Http2Stream::do_io_close()}} gets into {{_state==HTTP2_STREAM_STATE_HALF_CLOSED_REMOTE}} (in version 6.2.1) it misses to delete the stream before it orders the stream self destruction (few lines below) so when it gets to {{Http2Stream::destroy()}} it frees the memory without removing the stream from the active list and then it runs into the H2-infinite-loop if the sequence of events described in my previous post happens.

Making sure {{Http2Stream::destroy()}} deletes the stream before requesting the self-destruction (with {{VC_EVENT_EOS}}) fixes the problem (it can be solved in many different ways). 

Although I can see that there might be a race condition in that part of the code I just have never run into it ((1) never triggers and all experiments fit pretty well what I described in my previous post)

After your comment I identified 4 places where we might not be holding the lock properly when modifying stream_list/client_stream_cout (not actually happening, just reading/tracing the code), added them and repeated the experiments and got the same results (including the failures (2) and the fix).

I see that the master code has changed quite a lot since 6.2.1, checked with [~zwoop] and will probably prepare a PR for a back port to 6.2.x.

Since the master code changed (i.e. {{Http2Stream::do_io_close()}}) there is a chance we don’t run into this H2-infinite-loop condition anymore (have not tested it yet) but we still use {{DLL<>}} and a memory pool in the same way so it seems possible that we can run into the same problem if we are not careful and will study the new master code more to see if we can do something to avoid it.

Cheers,
—Gancho


> Http2ConnectionState::restart_streams infinite loop causes deadlock 
> --------------------------------------------------------------------
>
>                 Key: TS-4916
>                 URL: https://issues.apache.org/jira/browse/TS-4916
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core, HTTP/2
>            Reporter: Gancho Tenev
>            Assignee: Gancho Tenev
>            Priority: Blocker
>             Fix For: 7.1.0
>
>
> Http2ConnectionState::restart_streams falls into an infinite loop while holding a lock, which leads to cache updates to start failing.
> The infinite loop is caused by traversing a list whose last element “next” points to the element itself and the traversal never finishes.
> {code}
> Thread 51 (Thread 0x2aaab3d04700 (LWP 34270)):
> #0  0x00002aaaaacf3fee in Http2ConnectionState::restart_streams (this=0x2ae6ba5284c8) at Http2ConnectionState.cc:913
> #1  rcv_window_update_frame (cstate=..., frame=...) at Http2ConnectionState.cc:627
> #2  0x00002aaaaacf9738 in Http2ConnectionState::main_event_handler (this=0x2ae6ba5284c8, event=<optimized out>, edata=<optimized out>) at Http2ConnectionState.cc:823
> #3  0x00002aaaaacef1c3 in Continuation::handleEvent (data=0x2aaab3d039a0, event=2253, this=0x2ae6ba5284c8) at ../../iocore/eventsystem/I_Continuation.h:153
> #4  send_connection_event (cont=cont@entry=0x2ae6ba5284c8, event=event@entry=2253, edata=edata@entry=0x2aaab3d039a0) at Http2ClientSession.cc:58
> #5  0x00002aaaaacef462 in Http2ClientSession::state_complete_frame_read (this=0x2ae6ba528290, event=<optimized out>, edata=0x2aab7b237f18) at Http2ClientSession.cc:426
> #6  0x00002aaaaacf0982 in Continuation::handleEvent (data=0x2aab7b237f18, event=100, this=0x2ae6ba528290) at ../../iocore/eventsystem/I_Continuation.h:153
> #7  Http2ClientSession::state_start_frame_read (this=0x2ae6ba528290, event=<optimized out>, edata=0x2aab7b237f18) at Http2ClientSession.cc:399
> #8  0x00002aaaaacef5a3 in Continuation::handleEvent (data=0x2aab7b237f18, event=100, this=0x2ae6ba528290) at ../../iocore/eventsystem/I_Continuation.h:153
> #9  Http2ClientSession::state_complete_frame_read (this=0x2ae6ba528290, event=<optimized out>, edata=0x2aab7b237f18) at Http2ClientSession.cc:431
> #10 0x00002aaaaacf0982 in Continuation::handleEvent (data=0x2aab7b237f18, event=100, this=0x2ae6ba528290) at ../../iocore/eventsystem/I_Continuation.h:153
> #11 Http2ClientSession::state_start_frame_read (this=0x2ae6ba528290, event=<optimized out>, edata=0x2aab7b237f18) at Http2ClientSession.cc:399
> #12 0x00002aaaaae67e2b in Continuation::handleEvent (data=0x2aab7b237f18, event=100, this=<optimized out>) at ../../iocore/eventsystem/I_Continuation.h:153
> #13 read_signal_and_update (vc=0x2aab7b237e00, vc@entry=0x1, event=event@entry=100) at UnixNetVConnection.cc:153
> #14 UnixNetVConnection::readSignalAndUpdate (this=this@entry=0x2aab7b237e00, event=event@entry=100) at UnixNetVConnection.cc:1036
> #15 0x00002aaaaae47653 in SSLNetVConnection::net_read_io (this=0x2aab7b237e00, nh=0x2aaab2409cc0, lthread=0x2aaab2406000) at SSLNetVConnection.cc:595
> #16 0x00002aaaaae5558c in NetHandler::mainNetEvent (this=0x2aaab2409cc0, event=<optimized out>, e=<optimized out>) at UnixNet.cc:513
> #17 0x00002aaaaae8d2e6 in Continuation::handleEvent (data=0x2aaab0bfa700, event=5, this=<optimized out>) at I_Continuation.h:153
> #18 EThread::process_event (calling_code=5, e=0x2aaab0bfa700, this=0x2aaab2406000) at UnixEThread.cc:148
> #19 EThread::execute (this=0x2aaab2406000) at UnixEThread.cc:275
> #20 0x00002aaaaae8c0e6 in spawn_thread_internal (a=0x2aaab0b25bb0) at Thread.cc:86
> #21 0x00002aaaad6b3aa1 in start_thread (arg=0x2aaab3d04700) at pthread_create.c:301
> #22 0x00002aaaae8bc93d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
> {code}
> Here is the stream_list trace.
> {code}
> (gdb) thread 51
> [Switching to thread 51 (Thread 0x2aaab3d04700 (LWP 34270))]
> #0  0x00002aaaaacf3fee in Http2ConnectionState::restart_streams (this=0x2ae6ba5284c8) at Http2ConnectionState.cc:913
> (gdb) trace_list stream_list
> ------- count=0 -------
> id=29
> this=0x2ae673f0c840
> next=0x2aaac05d8900
> prev=(nil)
> ------- count=1 -------
> id=27
> this=0x2aaac05d8900
> next=0x2ae5b6bbec00
> prev=0x2ae673f0c840
> ------- count=2 -------
> id=19
> this=0x2ae5b6bbec00
> next=0x2ae5b6bbec00
> prev=0x2aaac05d8900
> ------- count=3 -------
> id=19
> this=0x2ae5b6bbec00
> next=0x2ae5b6bbec00
> prev=0x2aaac05d8900
> . . . 
> ------- count=5560 -------
> id=19
> this=0x2ae5b6bbec00
> next=0x2ae5b6bbec00
> prev=0x2aaac05d8900
> . . .
> {code}
> Currently I am working on finding out why the list in question got into this “impossible” (broken) state and and eventually coming up with a fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)