You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafficserver.apache.org by shinrich <gi...@git.apache.org> on 2017/02/21 16:02:46 UTC

[GitHub] trafficserver issue #1476: Crash in get_client_addr()

GitHub user shinrich opened an issue:

    https://github.com/apache/trafficserver/issues/1476

    Crash in get_client_addr()

    After working through issues #1401 and #1443, we now see the following crash in our copy of 7.1.  The crash occurs once every couple hours.
    
    ```
    #0  0x00002b46debb3867 in ?? () from /lib64/libgcc_s.so.1
    #1  0x00002b46debb4119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
    #2  0x00002b46deeb9936 in backtrace () from /lib64/libc.so.6
    #3  0x00002b46dc70fe72 in ink_stack_trace_dump () at ink_stack_trace.cc:61
    #4  0x00002b46dc71213a in signal_crash_handler (signo=11) at signals.cc:186
    #5  0x0000000000513414 in crash_logger_invoke (signo=11, info=0x2b46e39dc130, ctx=0x2b46e39dc000) at Crash.cc:169
    #6  <signal handler called>
    #7  0x0000000000000001 in ?? () at ../../lib/ts/Diags.h:142
    #8  0x00000000005126d5 in NetVConnection::get_remote_addr (this=0x2aad1cc4d0a0)
        at /home/shinrich/yats_build/trafficserver/iocore/net/P_NetVConnection.h:30
    #9  0x000000000055deea in ProxyClientSession::get_client_addr (this=0x2ab0dc25bba0) at ProxyClientSession.h:221
    #10 0x00000000005352bf in TSHttpSsnClientAddrGet (ssnp=0x2ab0dc25bba0) at InkAPI.cc:5439
    #11 0x000000000053530d in TSHttpTxnClientAddrGet (txnp=0x2b474440c9f0) at InkAPI.cc:5447
    #12 0x00002aaaae4008f6 in http_hook (contp=0x1e42f40, event=60006, edata=0x2b474440c9f0) at INKPluginInit.cc:174
    #13 0x000000000052a5fd in INKContInternal::handle_event (this=0x1e42f40, event=60006, edata=0x2b474440c9f0) at InkAPI.cc:1048
    #14 0x00000000005160f2 in Continuation::handleEvent (this=0x1e42f40, event=60006, data=0x2b474440c9f0)
        at /home/shinrich/yats_build/trafficserver/iocore/eventsystem/I_Continuation.h:153
    #15 0x000000000052adf3 in APIHook::invoke (this=0x1e44f40, event=60006, edata=0x2b474440c9f0) at InkAPI.cc:1267
    #16 0x00000000005e5e16 in HttpSM::state_api_callout (this=0x2b474440c9f0, event=0, data=0x0) at HttpSM.cc:1462
    #17 0x00000000005f3a11 in HttpSM::do_api_callout_internal (this=0x2b474440c9f0) at HttpSM.cc:5171
    #18 0x0000000000601b85 in HttpSM::do_api_callout (this=0x2b474440c9f0) at HttpSM.cc:438
    #19 0x00000000005e78d6 in HttpSM::state_read_server_response_header (this=0x2b474440c9f0, event=100, data=0x2b4744cac508)
        at HttpSM.cc:1962
    #20 0x00000000005ea331 in HttpSM::main_handler (this=0x2b474440c9f0, event=100, data=0x2b4744cac508) at HttpSM.cc:2662
    #21 0x00000000005160f2 in Continuation::handleEvent (this=0x2b474440c9f0, event=100, data=0x2b4744cac508)
        at /home/shinrich/yats_build/trafficserver/iocore/eventsystem/I_Continuation.h:153
    #22 0x000000000078489f in read_signal_and_update (event=100, vc=0x2b4744cac3e0) at UnixNetVConnection.cc:145
    #23 0x0000000000787a10 in UnixNetVConnection::readSignalAndUpdate (this=0x2b4744cac3e0, event=100) at UnixNetVConnection.cc:1125
    #24 0x000000000076a098 in SSLNetVConnection::net_read_io (this=0x2b4744cac3e0, nh=0x2b46e18ade60, lthread=0x2b46e18aa010)
        at SSLNetVConnection.cc:587
    #25 0x000000000077d20d in NetHandler::mainNetEvent (this=0x2b46e18ade60, event=5, e=0x1873f20) at UnixNet.cc:509
    #26 0x00000000005160f2 in Continuation::handleEvent (this=0x2b46e18ade60, event=5, data=0x1873f20)
        at /home/shinrich/yats_build/trafficserver/iocore/eventsystem/I_Continuation.h:153
    #27 0x00000000007a69c5 in EThread::process_event (this=0x2b46e18aa010, e=0x1873f20, calling_code=5) at UnixEThread.cc:143
    #28 0x00000000007a6eb1 in EThread::execute (this=0x2b46e18aa010) at UnixEThread.cc:270
    #29 0x00000000007a6089 in spawn_thread_internal (a=0x1762fe0) at Thread.cc:84
    #30 0x00002b46df156aa1 in start_thread () from /lib64/libpthread.so.0
    #31 0x00002b46deea393d in clone () from /lib64/libc.so.6
    ```
    I assume that there has been a goodly amount of inlining here so it is hard to tell how we got from NetVConnection::get_remote_addr to the Diags::on method.  Digging through the core, the cached member addresses seem ok, and I don't see evidence of thread races.
    
    I think I have seen this stack reported before, but couldn't find the issue.

----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] trafficserver issue #1476: Crash in get_client_addr()

Posted by shinrich <gi...@git.apache.org>.
Github user shinrich commented on the issue:

    https://github.com/apache/trafficserver/issues/1476
  
    Very odd bug.  Most of the crashes are in get_client_addr or get_remote_addr.  In both cases, the problem is at the point of the virtual method call into the corresponding test.  The RAX register is either 0 or 1.  But when you go through the assembly instructions based on the current register values, the function pointer should be valid.
    
    Tried this on two different machines in case it was a HW problem.  Still see the problem on both machine.   Set counters to see if the crash happens when calling into the set method or on return.  The crash happens before the set method is called.
    
    At this point, I think it is a use-after-free problem.  If I do the function calculation off a VC object (which would be the pointer in the virtual function table after free), the value is a 0 or 1.
    
    I've tried to set up ASAN builds but they are not working for me at the moment (issue filed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] trafficserver issue #1476: Crash in get_client_addr()

Posted by suenway <gi...@git.apache.org>.
Github user suenway commented on the issue:

    https://github.com/apache/trafficserver/issues/1476
  
    
    This is an automatically generated message.
    
    sunwei@yahoo-inc.com is no longer with Yahoo! Inc.
    
    Your message will not be forwarded.
    
    If you require assistance with a legal matter, please send a message to legal-notices@yahoo-inc.com
    
    Thank you!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] trafficserver issue #1476: Crash in get_client_addr()

Posted by shinrich <gi...@git.apache.org>.
Github user shinrich commented on the issue:

    https://github.com/apache/trafficserver/issues/1476
  
    Have been adding member variables to NetVConnection and adding assert to HttpSM::state_api_callout.
    
    In the cores it seemed that the crash always happened in the first hook called from HttpSM::state_read_server_response_header.  As noted in the previous comment, the event came in on the server side VC, and the state_read_server_response_header is getting ready to send the response of the UA side VC.   I added allocate thread, free_thread, and last_alloc time to the netvc.  And added the following assert at the beginning of NetVConnection 
    
    {code}
    ink_release_assert(milestones[TS_MILESTONE_UA_CLOSE] > 0 || ua_session->get_netvc() == NULL ||
                (ua_session->get_create_time() != 0 && ink_hrtime_to_msec(ua_session->get_create_time()) <= ink_hrtime_to_msec(milestones[TS_MILESTONE_UA_BEGIN])));
    {code}
    
    This will trigger if there is not last_alloc_time set (it has been freed) or the create time of the VC is newer than the start of the transaction (has been reallocated).  This assert would trigger before we called the hooks implying that the client_vc has been freed in a previous stack but the ua_session reference has not been cleaned up.  The only place I could see this happening is in read_signal_and_update and write_signal_and_update of the corresponding vio._cont is NULL.  I added warnings in the NULL path, and they triggered a handful of times with event=EVENT_ERROR (3).  
    
    The error bubble addition (which we were concerned about from issue #1401) adds calls to read_signal_error and write_signal_error.  So I reverted that commit (c1ac5f8bf87fd4bc3a8e06507219970d83965acd) and removed my workarounds for issue #1401.  I let it run overnight and it crashed twice on my assert, but in the accept stack not the send response stack.  The timestamps varied in the low microseconds, so I am doubting the accuracy of our timestamping at the micro and nano seconds.  I added the ink_hrtime_to_msec to the asserts, and kicked it off to run today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] trafficserver issue #1476: Crash in get_client_addr()

Posted by shinrich <gi...@git.apache.org>.
Github user shinrich commented on the issue:

    https://github.com/apache/trafficserver/issues/1476
  
    Still struggling with this issue.  Running into it every 2-4 hours in production.  At this point, I think it is a use-after-free, but as noted avoid ASAN fails in this case.  
    
    Looking at a core more closely, it seems to be an incoming client VC that is being messed up.  In the core, the remote_addr of the VC did not match the src_addr in the HttpSM.  Both IP addresses show up in our squid.log at roughly the same time.
    
    We are running with accept threads.  I'm going to turn that off on my test machine,  I assume that will eliminate this race condition.  If so, then will need to press on and see how we are freeing the VC early.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] trafficserver issue #1476: Crash in get_client_addr()

Posted by shinrich <gi...@git.apache.org>.
Github user shinrich commented on the issue:

    https://github.com/apache/trafficserver/issues/1476
  
    I had two more asserts once I added back in the accept thread.  Upon deeper inspection, the last alloc time and the UA_BEGIN time were off by half a ms, so I am attributing that to clock skew between threads.  I think the best action for 7.1.0 is to revert commit c1ac5f8bf87fd4bc3a8e06507219970d83965acd and investigate how to safely bubble up errors for 7.1.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---