You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafficserver.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/08/05 21:29:20 UTC

[jira] [Work logged] (TS-4717) Http2 stack explosion

     [ https://issues.apache.org/jira/browse/TS-4717?focusedWorklogId=26209&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-26209 ]

ASF GitHub Bot logged work on TS-4717:
--------------------------------------

                Author: ASF GitHub Bot
            Created on: 05/Aug/16 21:28
            Start Date: 05/Aug/16 21:28
    Worklog Time Spent: 10m 
      Work Description: GitHub user shinrich opened a pull request:

    https://github.com/apache/trafficserver/pull/842

    TS-4717:  Http2 stack explosion.

    Added a common state_process_frame_read method to loop over reading frames while there is data available.  The original state_start_frame_read and state_complete_frame_read call into state_process_frame_read so the event handling cases still work.  
    
    Have been running a version on of this code on two of our production boxes for a day.  We haven't had a load surge event, so I doubt we have seen a case that would have caused the stack explosion.  But the performance and error stats seem similar to their peers, so I don't think I have messed up the normal operating case.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/shinrich/trafficserver ts-4717

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/trafficserver/pull/842.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #842
    
----
commit a166cf0335672abd3514f43a081b4fce045725f2
Author: Susan Hinrichs <sh...@ieee.org>
Date:   2016-08-05T14:29:53Z

    TS-4717:  Http2 stack explosion.

----


Issue Time Tracking
-------------------

            Worklog Id:     (was: 26209)
            Time Spent: 10m
    Remaining Estimate: 0h

> Http2 stack explosion
> ---------------------
>
>                 Key: TS-4717
>                 URL: https://issues.apache.org/jira/browse/TS-4717
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: HTTP/2
>            Reporter: Susan Hinrichs
>            Assignee: Susan Hinrichs
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> We see this periodically with high traffic loads.  ATS crashes with 7000+ frames on the stack.  The bulk of the frames are the following frame sequence.  
> {code}
> #117 0x00000000005159c8 in Continuation::handleEvent (this=0x2b0bdd101b90, event=100, data=0x2b0bad0c7cf0)
>     at ../iocore/eventsystem/I_Continuation.h:150
> #118 0x000000000064c05d in Http2ClientSession::state_start_frame_read (this=0x2b0bdd101b90, event=100, edata=0x2b0bad0c7cf0)
>     at Http2ClientSession.cc:451
> #119 0x000000000064b0af in Http2ClientSession::main_event_handler (this=0x2b0bdd101b90, event=100, edata=0x2b0bad0c7cf0) at Http2ClientSession.cc:292
> #120 0x00000000005159c8 in Continuation::handleEvent (this=0x2b0bdd101b90, event=100, data=0x2b0bad0c7cf0)
>     at ../iocore/eventsystem/I_Continuation.h:150
> #121 0x000000000064c386 in Http2ClientSession::state_complete_frame_read (this=0x2b0bdd101b90, event=100, edata=0x2b0bad0c7cf0)
>     at Http2ClientSession.cc:483
> #122 0x000000000064b0af in Http2ClientSession::main_event_handler (this=0x2b0bdd101b90, event=100, edata=0x2b0bad0c7cf0) at Http2ClientSession.cc:292
> #123 0x00000000005159c8 in Continuation::handleEvent (this=0x2b0bdd101b90, event=100, data=0x2b0bad0c7cf0)
>     at ../iocore/eventsystem/I_Continuation.h:150
> #124 0x000000000064c05d in Http2ClientSession::state_start_frame_read (this=0x2b0bdd101b90, event=100, edata=0x2b0bad0c7cf0)
>     at Http2ClientSession.cc:451
> {code}
> We had cherry picked in the fix for TS-4209 to correctly enforce the concurrent stream limit.  But in the latest crash of this type, it looks like we are pulling small items from cache, so the stream lives and dies on the stack.  The concurrent active connection count never reaches the limit.
> I am going to try to change the state_state_start_frame_read/state_complete_frame_read logic from recursing handlers to a loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)