You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafficserver.apache.org by GitBox <gi...@apache.org> on 2020/10/08 16:10:22 UTC

[GitHub] [trafficserver] calavera opened a new issue #7255: Deadlock in UnxNetVConnection with 9.0.x and http 1

calavera opened a new issue #7255:
URL: https://github.com/apache/trafficserver/issues/7255


   There has been some investigation over the years about this, but it was never really resolved. @oknet put some asserts in the code to crash when the NetHandler has been acquired but not released, and we bumped into it this week.
   
   It's hard for me to give a working example to reproduce because it's very coupled with a lot of other code, but I'll explain what the code does, so hopefully someone has an idea to solve it.
   
   1. We have a plugin with session hooks, TS_EVENT_HTTP_SSN_START and TS_EVENT_HTTP_SSN_CLOSE.
   2. On SSN_START, the plugin sends a request to `https://localhost`, which is the TrafficServer service itself, which then proxies to a Google Compute Storage bucket. We do this because we want to cache the response from GCS in the same node that sends the request.
   3. On SSN_CLOSE, we do two things, remove an object from the session, with `TSHttpSsnArgSet`, and then reenable the session with `TSHttpSsnReenable(session, TS_EVENT_HTTP_CONTINUE);`. The crash happens when the session releases resources after it's reenabled.
   
   As I mention in the title, this only happens with http/1 because http/2 doesn't have any of the same keepalive logic. It's very easy to reproduce with `curl --http1.1` when "you have the whole system working" :laughing: 
   
   ```
   [Oct  3 19:19:35.072] [ET_NET 11] DIAG: (netlify_resolve_async) Resolving bundle...                          host=traffic-mesh-test.netlify.com
   Fatal: UnixNetVConnection.cc:1449: failed assertion `!"BUG: It must have acquired the NetHandler's lock before doing anything on keep_alive_queue."`
   traffic_server: received signal 6 (Aborted)
   traffic_server - STACK TRACE: 
   bin/traffic_server(_Z19crash_logger_invokeiP9siginfo_tPv+0xa8)[0x555d66264578]
   /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f4e5f2783c0]
   /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f4e5ed6418b]
   /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f4e5ed43859]
   /opt/ts/lib/libtscore.so.9(+0x60cc1)[0x7f4e5f77fcc1]
   /opt/ts/lib/libtscore.so.9(+0x5e01b)[0x7f4e5f77d01b]
   bin/traffic_server(_ZN18UnixNetVConnection23add_to_keep_alive_queueEv+0x125)[0x555d66506525]
   bin/traffic_server(_ZN18Http1ClientSession7releaseEP16ProxyTransaction+0x144)[0x555d662c91a4]
   bin/traffic_server(_ZN12ProxySession17handle_api_returnEi+0xbe)[0x555d664b0d9e]
   bin/traffic_server(_ZN12ProxySession17state_api_calloutEiPv+0x52)[0x555d664b0e12]
   bin/traffic_server(_ZN17TSHttpSsnCallback13event_handlerEiPv+0x6e)[0x555d66293c7e]
   bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x34d)[0x555d6654f21d]
   bin/traffic_server(_ZN7EThread13process_queueEP5QueueI5EventNS1_9Link_linkEEPiS5_+0x24e)[0x555d6654f94e]
   bin/traffic_server(_ZN7EThread15execute_regularEv+0x186)[0x555d6654fdf6]
   bin/traffic_server(_ZN7EThread7executeEv+0x1ee)[0x555d6655054e]
   bin/traffic_server(+0x3e88a9)[0x555d6654e8a9]
   /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7f4e5f26c609]
   /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f4e5ee40293]
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficserver] shinrich commented on issue #7255: Deadlock in UnixNetVConnection with 9.0.x and http 1

Posted by GitBox <gi...@apache.org>.
shinrich commented on issue #7255:
URL: https://github.com/apache/trafficserver/issues/7255#issuecomment-705737284


   Does this happen fairly reliably when a SSN_CLOSE hook is used?  What is the event value that is being passed up the callstack?  Generally this is a net I/O event, so the nethandler is already locked.  However in this case, the item at the bottom of the stack does not appear to be a net I/O event.  Probably a reschedule. (event 1 or 2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficserver] shinrich commented on issue #7255: Deadlock in UnixNetVConnection with 9.0.x and http 1

Posted by GitBox <gi...@apache.org>.
shinrich commented on issue #7255:
URL: https://github.com/apache/trafficserver/issues/7255#issuecomment-705737284


   Does this happen fairly reliably when a SSN_CLOSE hook is used?  What is the event value that is being passed up the callstack?  Generally this is a net I/O event, so the nethandler is already locked.  However in this case, the item at the bottom of the stack does not appear to be a net I/O event.  Probably a reschedule. (event 1 or 2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficserver] calavera commented on issue #7255: Deadlock in UnxNetVConnection with 9.0.x and http 1

Posted by GitBox <gi...@apache.org>.
calavera commented on issue #7255:
URL: https://github.com/apache/trafficserver/issues/7255#issuecomment-705675372


   It might also be worth mentioning that we don't use the Fetch apis that TrafficServer provides to send that request, because we want to use an H/2 client for better performance fetching those objects from GCS, so the call is not considered internal from TrafficServer's point of view.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficserver] calavera commented on issue #7255: Deadlock in UnxNetVConnection with 9.0.x and http 1

Posted by GitBox <gi...@apache.org>.
calavera commented on issue #7255:
URL: https://github.com/apache/trafficserver/issues/7255#issuecomment-705675372


   It might also be worth mentioning that we don't use the Fetch apis that TrafficServer provides to send that request, because we want to use an H/2 client for better performance fetching those objects from GCS, so the call is not considered internal from TrafficServer's point of view.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficserver] shinrich closed issue #7255: Deadlock in UnixNetVConnection with 9.0.x and http 1

Posted by GitBox <gi...@apache.org>.
shinrich closed issue #7255:
URL: https://github.com/apache/trafficserver/issues/7255


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficserver] shinrich commented on issue #7255: Deadlock in UnixNetVConnection with 9.0.x and http 1

Posted by GitBox <gi...@apache.org>.
shinrich commented on issue #7255:
URL: https://github.com/apache/trafficserver/issues/7255#issuecomment-715443661


   I have a solution.  Will put up a PR shortly for you to try @calavera.  I want to add an autest as well to catch the regression.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficserver] shinrich commented on issue #7255: Deadlock in UnixNetVConnection with 9.0.x and http 1

Posted by GitBox <gi...@apache.org>.
shinrich commented on issue #7255:
URL: https://github.com/apache/trafficserver/issues/7255#issuecomment-715381499


   So TSHttpSsnReenable, will do a schedule_imm (which results in event 1) with TSHttpSsnReenable as the target continuation  if it fails to get the session mutex.  Or if the reenable is called from a non ET_NET thread.  I'm guessing that is the case that is triggering for your plugin.  I doubt there would be lock contention on the session.
   
   So that is not coming up through the net I/O event handling, so the nethandler lock is not being held before the add_to_keep_alive_queue() is called.
   
   Let me ponder how/where to add the nethandler lock in this case.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficserver] calavera commented on issue #7255: Deadlock in UnixNetVConnection with 9.0.x and http 1

Posted by GitBox <gi...@apache.org>.
calavera commented on issue #7255:
URL: https://github.com/apache/trafficserver/issues/7255#issuecomment-706818279


   I was wrong, it looks like it happens during SSN_START, after the hook is triggered. The event is a TS_EVENT_HTTP_CONTINUE triggered with TSHttpSsnReenable.
   
   I changed our plugin to not go through trafficserver for that http request, going directly to the origin and not caching anything. It still fails in the same way, so that's not what's causing the problem.
   
   I got this stacktrace in gdb:
   
   ```
   (gdb) backtrace
   #0  0x00007ffff752318b in raise () from /lib/x86_64-linux-gnu/libc.so.6
   #1  0x00007ffff7502859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
   #2  0x00007ffff7f41a5a in ink_abort (message_format=message_format@entry=0x7ffff7fa3b6f "%s:%d: failed assertion `%s`") at ink_error.cc:99
   #3  0x00007ffff7f3ef24 in _ink_assert (expression=expression@entry=0x5555559d7110 "!\"BUG: It must have acquired the NetHandler's lock before doing anything on keep_alive_queue.\"", 
       file=file@entry=0x5555559d6b63 "UnixNetVConnection.cc", line=line@entry=1449) at ink_assert.cc:37
   #4  0x000055555593d321 in UnixNetVConnection::add_to_keep_alive_queue (this=0x7fff70027b20) at UnixNetVConnection.cc:1449
   #5  0x00005555556b70a9 in Http1ClientSession::release (this=0x7fffe81f5bc0, trans=0x7fffe81f5f38) at Http1ClientSession.cc:419
   #6  0x00005555556b56ed in Http1ClientSession::start (this=<optimized out>) at Http1ClientSession.cc:509
   #7  0x00005555558e03e9 in ProxySession::handle_api_return (this=this@entry=0x7fffe81f5bc0, event=event@entry=60000) at ProxySession.cc:166
   #8  0x00005555558e0b81 in ProxySession::state_api_callout (this=0x7fffe81f5bc0, event=60000, data=<optimized out>) at ProxySession.cc:121
   #9  0x0000555555680dcb in Continuation::handleEvent (data=0x0, event=60000, this=0x7fffe81f5bc0) at ./iocore/eventsystem/I_Continuation.h:190
   #10 TSHttpSsnCallback::event_handler (this=0x7fff7c002260) at traffic_server/InkAPI.cc:4906
   #11 0x0000555555978d4a in Continuation::handleEvent (data=0x555555cb7f80, event=1, this=0x7fff7c002260) at I_Continuation.h:190
   #12 EThread::process_event (this=this@entry=0x7ffff66ab010, e=e@entry=0x555555cb7f80, calling_code=1) at UnixEThread.cc:127
   #13 0x00005555559793de in EThread::process_queue (this=this@entry=0x7ffff66ab010, NegativeQueue=NegativeQueue@entry=0x7ffff5fdcde0, ev_count=ev_count@entry=0x7ffff5fdcdd4, 
       nq_count=nq_count@entry=0x7ffff5fdcdd0) at UnixEThread.cc:162
   #14 0x00005555559798a8 in EThread::execute_regular (this=this@entry=0x7ffff66ab010) at UnixEThread.cc:222
   #15 0x000055555597a22f in EThread::execute (this=0x7ffff66ab010) at UnixEThread.cc:327
   #16 0x00005555559782bb in spawn_thread_internal (a=0x555555caf890) at Thread.cc:92
   #17 0x00007ffff7a2b609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
   #18 0x00007ffff75ff293 in clone () from /lib/x86_64-linux-gnu/libc.so.6
   ```
   
   These are some more logs that I captured with my debug version:
   
   ```
   [Oct 12 02:11:32.972] [ET_NET 1] DEBUG: <SSLNetVConnection.cc:1288 (sslServerHandShakeEvent)> (ssl) SSL server handshake completed successfully
   [Oct 12 02:11:32.972] [ET_NET 1] DEBUG: <SSLNetVConnection.cc:1302 (sslServerHandShakeEvent)> (ssl) ssl handshake time:130845435
   [Oct 12 02:11:32.972] [ET_NET 1] DEBUG: <SSLNetVConnection.cc:1325 (sslServerHandShakeEvent)> (ssl) client selected next protocol 'http/1.1'
   [Oct 12 02:11:32.972] [ET_NET 1] DEBUG: <SSLNetVConnection.cc:632 (net_read_io)> (ssl) ssl handshake EVENT_DONE ntodo=0
   [Oct 12 02:11:32.972] [ET_NET 1] DEBUG: <Http1ClientSession.cc:148 (new_connection)> (ssl_early_data) read_from_early_data = 0
   [Oct 12 02:11:32.972] [ET_NET 1] DEBUG: <Http1ClientSession.cc:193 (new_connection)> (http_cs) [1] session born, netvc 0x7fff70027b20
   [Oct 12 02:11:32.972] [ET_NET 1] DIAG: (netlify_traffic_mesh_resolve) Handling session event event=60013
   [Oct 12 02:11:32.972] [ET_NET 1] DIAG: (netlify_traffic_mesh_resolve) Resolving bundle...                          host=votelify.edge-handlers.dev ssn_start=true
   [Oct 12 02:11:32.972] [ET_NET 1] DIAG: (netlify_resolve_async) Resolving bundle...                          host=votelify.edge-handlers.dev
   [New Thread 0x7fffa467f700 (LWP 247)]
   [New Thread 0x7fffa447e700 (LWP 248)]
   [New Thread 0x7fffa427d700 (LWP 249)]
   [New Thread 0x7fff844a4700 (LWP 250)]
   [New Thread 0x7fff842a3700 (LWP 251)]
   [New Thread 0x7fff7bfff700 (LWP 252)]
   [Oct 12 02:11:32.978] mesh-tokio-work DIAG: (netlify_resolve_async) Bundle resolution done
   [Oct 12 02:11:32.978] mesh-tokio-work DIAG: (netlify_traffic_mesh_resolve) Bundle resolution async completed found=true
   [Oct 12 02:11:32.978] [ET_NET 2] DEBUG: <Http1ClientSession.cc:411 (release)> (http_cs) [1] initiating io for next header
   Fatal: UnixNetVConnection.cc:1449: failed assertion `!"BUG: It must have acquired the NetHandler's lock before doing anything on keep_alive_queue."`
   --Type <RET> for more, q to quit, c to continue without paging--RET
   
   Thread 5 "[ET_NET 2]" received signal SIGABRT, Aborted.
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org