You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@trafficserver.apache.org by "Craig Schomburg (craigs)" <cr...@cisco.com> on 2016/06/07 02:16:48 UTC

Possible problem isolation!!! SSLNetVConnection IOBufferBlock management issue on ATS 6.0.0

Susan,

I believe we may have found the root of our problem.  Would like your input
to confirm.  Appears that this is a area you are intimately familiar with ☺

So again this is a continuation of the progress update I sent on 05-June
(below)…  So again, single SSL session with multiple successive GETs, stuck
when the initial session buffer is used up…

//
//  Notice that mr (max_read_avail) is increasing with each received packet
//  and eventually maxes out at the buffer size (4096) at which point mw
//  (cur_write_avail) is also ZERO...  No new buffer is allocated due to
//  max_read_avail stuck at 4096...
//
//  Also noticed 2 readers allocated and only one of the two reflects the
//  fact that all data has been processed (read_avail = 0).  Other just
//  keeps growing...
//
//    (net_read_io)> (ssl) craigs net_read_io() mr for readers[0].read_avail=2055
//    (net_read_io)> (ssl) craigs net_read_io() mr for readers[1].read_avail=0
//
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:590 (net_read_io)> (ssl) craigs net_read_io() BEFORE buffer check b=0x55a3d9703cd0, b->next=(nil), b->write_avail()=2041, ntodo=9223372036854775807, wm=0, mr=2055, mw=2041
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:595 (net_read_io)> (ssl) craigs net_read_io() mr for mbuf=0x55a3d9736b80 _writer.read_avail=2055
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:600 (net_read_io)> (ssl) craigs net_read_io() mr for readers[0].read_avail=2055
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:600 (net_read_io)> (ssl) craigs net_read_io() mr for readers[1].read_avail=0
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:207 (ssl_read_from_net)> (ssl) trace=FALSE
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:213 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] b=0x55a3d9703cd0, bnxt=(nil), b->write_avail()=2041
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:220 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] nread=2041
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:231 (ssl_read_from_net)> (ssl) craigs ssl_read_from_net sslErr=0
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:252 (ssl_read_from_net)> (ssl) craigs ssl_read_from_net SSL_ERROR_NONE block_write_avail=0
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:307 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] bytes_read=2041
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:320 (ssl_read_from_net)> (ssl) craigs ssl_read_from_net SSL_READ_READY bytes_read=2041, ntodo=9223372036854773766
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <HttpClientSession.cc:412 (state_keep_alive)> (http_cs) [4] [&HttpClientSession::state_keep_alive, VC_EVENT_READ_READY]
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <HttpClientSession.cc:139 (new_transaction)> (http_cs) [4] Starting transaction 2 for vc=0x55a3d97534c0 using sm [5]
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <HttpSM.cc:2530 (main_handler)> (http) [5] [HttpSM::main_handler, VC_EVENT_READ_READY]
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <HttpSM.cc:574 (state_read_client_request_header)> (http) [5] [&HttpSM::state_read_client_request_header, VC_EVENT_READ_READY]
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <MIME.cc:2454 (mime_scanner_get)> (http) craigs mime_scanner_get() PARSE result=1
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <MIME.cc:2454 (mime_scanner_get)> (http) craigs mime_scanner_get() PARSE result=1
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <MIME.cc:2454 (mime_scanner_get)> (http) craigs mime_scanner_get() PARSE result=1
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <MIME.cc:2454 (mime_scanner_get)> (http) craigs mime_scanner_get() PARSE result=1
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <MIME.cc:2454 (mime_scanner_get)> (http) craigs mime_scanner_get() PARSE result=1
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <MIME.cc:2454 (mime_scanner_get)> (http) craigs mime_scanner_get() PARSE result=1
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <MIME.cc:2454 (mime_scanner_get)> (http) craigs mime_scanner_get() PARSE result=1
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <MIME.cc:2454 (mime_scanner_get)> (http) craigs mime_scanner_get() PARSE result=2
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <HdrTSOnly.cc:86 (parse_req)> (http) craigs parse_req() PARSE result=2
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <HttpSM.cc:626 (state_read_client_request_header)> (http) craigs state_read_client_request_header POST parse_req() event=100, state=2, bytes_used=2041
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <HttpSM.cc:710 (state_read_client_request_header)> (http) craigs state_read_client_request_header PARSE_CONT last else leg
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <HttpSM.cc:720 (state_read_client_request_header)> (http) craigs state_read_client_request_header PARSE_CONT/VC_EVENT_CONT called reenable()
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <UnixNet.cc:516 (mainNetEvent)> (ssl) craigs RRL post dequeue net_read_io() [1] called from mainNetEvent(), vc=0x55a3d97534c0
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:590 (net_read_io)> (ssl) craigs net_read_io() BEFORE buffer check b=0x55a3d9703cd0, b->next=(nil), b->write_avail()=0, ntodo=9223372036854775807, wm=0, mr=4096, mw=0
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:595 (net_read_io)> (ssl) craigs net_read_io() mr for mbuf=0x55a3d9736b80 _writer.read_avail=4096
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:600 (net_read_io)> (ssl) craigs net_read_io() mr for readers[0].read_avail=4096
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:600 (net_read_io)> (ssl) craigs net_read_io() mr for readers[1].read_avail=0
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:606 (net_read_io)> (ssl) craigs net_read_io() read_disable b=0x55a3d9703cd0, b->next=(nil), b->write_avail()=0, ntodo=9223372036854775807


//
//  Looked at HttpClientSession::new_connection() and found that ATS allowed
//  2 readers to be defined for the session.  In this case the input MIOBuffer
//  and IOBufferReader were NULL.  However, ssl_vc->get_ssl_reader() returned
//  an allocated reader and then ATS turned right around and allocated and
//  overwrote the sm_reader with another new reader because the input reader
//  was NULL... walla 2 readers...
//

void
HttpClientSession::new_connection(NetVConnection *new_vc, MIOBuffer *iobuf, IOBufferReader *reader, bool backdoor)
{
  <snip...>

  if (!iobuf) {
    SSLNetVConnection *ssl_vc = dynamic_cast<SSLNetVConnection *>(new_vc);
    if (ssl_vc) {
      iobuf = ssl_vc->get_ssl_iobuf();
      sm_reader = ssl_vc->get_ssl_reader();
      DebugHttpSsn("craigs new_connection !iobuf ssl_vc alloc MIOBuffer=%p existing IOBufferReader=%p", iobuf, sm_reader);
    }
  }

  DebugHttpSsn("craigs new_connection input MIOBuffer=%p IOBufferReader=%p", iobuf, reader);
  read_buffer = iobuf ? iobuf : new_MIOBuffer(HTTP_HEADER_BUFFER_SIZE_INDEX);
  sm_reader = reader ? reader : read_buffer->alloc_reader();
  DebugHttpSsn("craigs new_connection final MIOBuffer=%p IOBufferReader=%p", read_buffer, sm_reader);


2016-06-07T04:51:16.470+08:00 ts[11903]: {0x2b3dd11d4700} DEBUG: <HttpSessionAccept.cc:60 (accept)> (http_seq) [HttpSessionAccept:mainEvent 0x55a3d9752de0] accepted connection from 10.150.34.84:55476 transport type = 4
2016-06-07T04:51:16.470+08:00 ts[11903]: {0x2b3dd11d4700} DEBUG: <HttpClientSession.cc:201 (new_connection)> (http_cs) [6] session born, netvc 0x55a3d9752de0
2016-06-07T04:51:16.470+08:00 ts[11903]: {0x2b3dd11d4700} DEBUG: <HttpClientSession.cc:208 (new_connection)> (http_cs) craigs new_connection !iobuf ssl_vc alloc MIOBuffer=0x55a3d97366d0 existing IOBufferReader=0x55a3d97366e8
2016-06-07T04:51:16.470+08:00 ts[11903]: {0x2b3dd11d4700} DEBUG: <HttpClientSession.cc:212 (new_connection)> (http_cs) craigs new_connection input MIOBuffer=0x55a3d97366d0 IOBufferReader=(nil)
2016-06-07T04:51:16.470+08:00 ts[11903]: {0x2b3dd11d4700} DEBUG: <P_IOBuffer.h:832 (alloc_reader)> (ssl) craigs MIOBuffer::alloc_reader() reader-idx=1, MIOBuffer=0x55a3d97366d0, IOBufferReader=0x55a3d9736710
2016-06-07T04:51:16.470+08:00 ts[11903]: {0x2b3dd11d4700} DEBUG: <HttpClientSession.cc:215 (new_connection)> (http_cs) craigs new_connection final MIOBuffer=0x55a3d97366d0 IOBufferReader=0x55a3d9736710
2016-06-07T04:51:16.470+08:00 ts[11903]: {0x2b3dd11d4700} DIAG: (drop_failed_auth) Session Start

//
//  Made the following modification to use the input reader if defined, else
//  the ssl_vc->get_ssl_reader() and lastly allocate a new reader only if
//  neither of the previous readers was defined.  Gave this a try and our
//  issue no longer occurs.  Reads from multiple GETs on the same session
//  are processed, buffer wraps are handled, and all "appears" to be happy.
//

  DebugHttpSsn("craigs new_connection input MIOBuffer=%p IOBufferReader=%p", iobuf, reader);
  if (!iobuf) {
    SSLNetVConnection *ssl_vc = dynamic_cast<SSLNetVConnection *>(new_vc);
    if (ssl_vc) {
      iobuf = ssl_vc->get_ssl_iobuf();
      reader = reader ? reader : ssl_vc->get_ssl_reader();   // Modified line
      DebugHttpSsn("craigs new_connection !iobuf ssl_vc alloc MIOBuffer=%p input or existing IOBufferReader=%p", iobuf, reader);
    }
  }

  read_buffer = iobuf ? iobuf : new_MIOBuffer(HTTP_HEADER_BUFFER_SIZE_INDEX);
  sm_reader = reader ? reader : read_buffer->alloc_reader();
  DebugHttpSsn("craigs new_connection final MIOBuffer=%p IOBufferReader=%p", read_buffer, sm_reader);


Now, with the above working, I looked at the latest code in GIT Hub and now
see that the above functionality was rewritten via:

   TS-3612: Restructure Proxy Client Sessions to support transaction oriented
            Sessions execute transaction hooks and connection oriented Sessions
            execute session hooks

   https://issues.apache.org/jira/browse/TS-3612.

Susan, I see you were the author of this rewrite.  Also looking at the code this
appears to be a rather substantial rewrite that went into 6.2.0.  Also appears
that there may be a number of dependencies on other TS-cases?

So my question is might the change I implemented above an acceptable option
to address this issue as an interim solution until we can upgrade to 6.2.0 or
might there be other issues if we use my simple fix approach above?  Any other
options for a simple fix short of pulling in TS-3612?

If TS-3612 is needed is this a reasonable patchable effort for ATS 6.2.0?  Are
there other TS-cases that I would be needed in conjunction with TS-3612?

Thanks again for all of the insight from Susan, Alan, and Chao thus far.

Craig Schomburg



From: Craig Schomburg <cr...@cisco.com>
Date: Sunday, June 5, 2016 at 10:14 PM
To: Susan Hinrichs <sh...@network-geographics.com>, Alan Carroll <so...@yahoo-inc.com>, Chao Xu <xu...@gmail.com>
Cc: Craig Schomburg <cr...@cisco.com>
Subject: UPDATE: SSLNetVConnection IOBufferBlock management issue on ATS 6.0.0


Susan, Alan, and Chao,



Did not send this on the ATS alias due to attached debug log segment.



Spent the weekend continuing the debug efforts but this has been a hard issue to chase down as traffic server is a new area for me.  Wanted to provide a bit more information to see if this rings a bell or to get some debug guidance.  Hoping that this information makes a bit more sense and hoping you folks have time to provide guidance based on your expertise with this code base.  We are on ATS 6.0.0 at this time.



The original information below from last Friday’s email is still true with the exception of the statement below that stated that we never saw net_read_io() resume on that session.



Here is an updated description as we have progressed in our investigation this weekend…



·         As stated below, we have a remote Client that generates a total of 4 SSL sessions at about the same time.

·         Data was only sent on 2 of the 4 sessions.  One of the two sessions terminated after sending a single HTTP/1.1 GET request so no issues were seen on that session.  The other two sessions were simply opened and then closed (not sure what their use was).

·         On the session that we hit the problem on, it issues multiple GET requests on the same sessions waiting for a ACK on the previous GET before sending the next.



So here is the updated information…



·         As each GET is received, we see the call made in net_read_io() to check the buffer status before proceeding with the next read…

  if (ntodo <= 0 || !buf.writer()->write_avail()) {

    read_disable(nh, this);

    return;

  }

With the preceding GET request calls, since there was still buffer space available, no new buffer was allocated as a result of this call.  That being said, we added debug code to dump the max_read_avail, cur_write_avail, and watermark values just prior to this call.  What we see is that the max_read_avail increases with each packet based on the received packet size and likewise cur_write_avail decreases.  Expected to see cur_write_avail decrease but at the moment we can not explain while max_read_avail was not “consumed” as each preceding packet was received, processed, and ack’ed (200 OK).



·         When the GET request was finally received that exceeded cur_write_avail, the amount remaining in cur_write_avail was read and HTTP/MIME parse returned PARSE_CONT.  Previously I stated that net_io_read() was not called again and that no further processing occurred.  That was in fact NOT CORRECT.  What happened was net_io_read() was in fact called as was !buf.writer->write_avail().  In this case since max_read_avail was 4096, cur_write_avail was 0, and water_mark was 0 add_block was not called and no buffer allocated.  As a result read_disable was also called.  After the call to read_disable() no further activity was seen in the traffic_server trace on this session.



Short snippet of key data from the included Debug.log. Complete trace of this sequence is in the Debug.log



In the log below…

  wm =  water_mark

  mr =  max_read_avail

  mw = cur_write_avail



//

//  1st GET request

//

2016-06-06T07:59:39.638+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <UnixNet.cc:516 (mainNetEvent)> (ssl) craigs RRL post dequeue net_read_io() [1] called from mainNetEvent(), vc=0x564698572e80

2016-06-06T07:59:39.638+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:590 (net_read_io)> (ssl) craigs net_read_io() BEFOREbuffer check b=0x564698523cd0, b->next=(nil), b->write_avail()=4096, ntodo=9223372036854775807, wm=0, mr=0, mw=4096

2016-06-06T07:59:39.638+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:213 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] b=0x564698523cd0, bnxt=(nil), b->write_avail()=4096

2016-06-06T07:59:39.638+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:220 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] nread=2008

2016-06-06T07:59:39.638+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:252 (ssl_read_from_net)> (ssl) craigs ssl_read_from_net SSL_ERROR_NONE block_write_avail=2088



2016-06-06T07:59:39.653+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:590 (net_read_io)> (ssl) craigs net_read_io() BEFOREbuffer check b=0x564698523cd0, b->next=(nil), b->write_avail()=2088, ntodo=0, wm=0, mr=2008, mw=2088

2016-06-06T07:59:39.653+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:596 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - readers[0].read_avail=2008

2016-06-06T07:59:39.653+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:596 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - readers[1].read_avail=0

2016-06-06T07:59:39.653+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:601 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - _writer.read_avail=2008

2016-06-06T07:59:39.653+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:606 (net_read_io)> (ssl) craigs net_read_io() read_disable b=0x564698523cd0, b->next=(nil), b->write_avail()=2088, ntodo=0



//

//  Above packet processed and 200 OK sent...

//

2016-06-06T07:59:39.736+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <UnixNet.cc:516 (mainNetEvent)> (ssl) craigs RRL post dequeue net_read_io() [1] called from mainNetEvent(), vc=0x564698572e80

2016-06-06T07:59:39.736+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:590 (net_read_io)> (ssl) craigs net_read_io() BEFOREbuffer check b=0x564698523cd0, b->next=(nil), b->write_avail()=2088, ntodo=9223372036854775807, wm=0, mr=2008, mw=2088

2016-06-06T07:59:39.736+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:596 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - readers[0].read_avail=2008

2016-06-06T07:59:39.736+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:596 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - readers[1].read_avail=0

2016-06-06T07:59:39.736+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:601 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - _writer.read_avail=2008

2016-06-06T07:59:39.736+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:213 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] b=0x564698523cd0, bnxt=(nil), b->write_avail()=2088

2016-06-06T07:59:39.736+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:220 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] nread=0



//

//  2nd GET request

//

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <UnixNet.cc:516 (mainNetEvent)> (ssl) craigs RRL post dequeue net_read_io() [1] called from mainNetEvent(), vc=0x564698572e80

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:590 (net_read_io)> (ssl) craigs net_read_io() BEFOREbuffer check b=0x564698523cd0, b->next=(nil), b->write_avail()=2088, ntodo=9223372036854775807, wm=0, mr=2008, mw=2088

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:596 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - readers[0].read_avail=2008

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:596 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - readers[1].read_avail=0

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:601 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - _writer.read_avail=2008

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:213 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] b=0x564698523cd0, bnxt=(nil), b->write_avail()=2088

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:220 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] nread=2088

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:231 (ssl_read_from_net)> (ssl) craigs ssl_read_from_net sslErr=0

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:252 (ssl_read_from_net)> (ssl) craigs ssl_read_from_net SSL_ERROR_NONE block_write_avail=0

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:307 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] bytes_read=2088

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:320 (ssl_read_from_net)> (ssl) craigs ssl_read_from_net SSL_READ_READY bytes_read=2088, ntodo=9223372036854773719



2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <MIME.cc:2454 (mime_scanner_get)> (http) craigs mime_scanner_get() PARSE result=2

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <HdrTSOnly.cc:86 (parse_req)> (http) craigs parse_req()  PARSE result=2 (PARSE_CONT)

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <HttpSM.cc:626 (state_read_client_request_header)> (http) craigs state_read_client_request_header POST parse_req() event=100, state=2

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <HttpSM.cc:710 (state_read_client_request_header)> (http) craigs state_read_client_request_header PARSE_CONT last else leg

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <HttpSM.cc:720 (state_read_client_request_header)> (http) craigs state_read_client_request_header PARSE_CONT/VC_EVENT_CONT called reenable()



2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <UnixNet.cc:516 (mainNetEvent)> (ssl) craigs RRL post dequeue net_read_io() [1] called from mainNetEvent(), vc=0x564698572e80

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:590 (net_read_io)> (ssl) craigs net_read_io() BEFOREbuffer check b=0x564698523cd0, b->next=(nil), b->write_avail()=0, ntodo=9223372036854775807, wm=0, mr=4096, mw=0

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:596 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - readers[0].read_avail=4096

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:596 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - readers[1].read_avail=0

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:601 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - _writer.read_avail=4096

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:606 (net_read_io)> (ssl) craigs net_read_io() read_disable b=0x564698523cd0, b->next=(nil), b->write_avail()=0, ntodo=9223372036854775807

//

//  Above is end of failed GET request transaction 2 for vc=0x564698572e80 using sm [6].

//  No further activity was seein in the trace on this session.

//



So looking at the code what we are having problems identifying is how/where consume() gets called on the IOBufferBlock to update _start as each preceding packet and partial packet was processed?



One additional data point is I tried to recreate this same issue with an external test harness creating the session traffic but in that case we did not see this issue.  I was able to send multiple GET requests serially on a single session and see the buffer depletion/recovery and no issues were seen.  In addition I did not see the continuously accumulating max_read_avail and cur_write_avail was decremented by the packet size as each new packet was received and processed.



So at this point I am thoroughly baffled by the differing behaviors between the real recreate and the simulated recreate.  What I am trying to understand is what might be different between these two sessions to cause the difference in behavior processing the HTTP/1.1 GET requests.



Again, any debug pointers would be greatly appreciated as our progress in narrowing in on the root cause has been much to slow.



Thanks,



Craig Schomburg





On 6/3/16, 3:43 PM, "Craig Schomburg (craigs)" <cr...@cisco.com> wrote:



Leif and Susan,



Based on your knowledge of the releases 6.0.0 and beyond, if we were looking at the best blessed release so far would you recommend looking at 6.1.1 (last posted release on the web site) or is there a 6.2 that is blessed as stable and deployed that would be a better option?



Also we are just about to FCS a new internal product release with 6.0.0.  Are there any other known deployments of 6.0.0 that have indicated they are stable?



We are still digging into the SSLNetVConnection IOBufferBlock management issue mentioned on our previous threads but are considering looking at a later release in parallel.



As an update for Susan and Alan, on our SSLNetVConnection IOBufferBlock issue we are progressing further but still digging in on the root cause (slow digging).  Turns out the previous SSL_ERROR_ZERO_RETURN data point we had mentioned was on a different session so was unrelated to this issue.



What we have determined thus far is

·         we have a Client that is generating ~3 SSL sessions in parallel (same Src IP, Different Src-Port).

·         On the session in error, it issues 3 GET requests serially (each is ~1.6K), each packet associated with the GET is in 2 fragments.

·         As each GET HTTP/1.1. is received we work our way through the first IOBufferBlock and on the 3rd we have insufficient buffer space for the complete packet (buffer_write_avail = 0, b->next = NULL).

·         The first part of the 3rd GET is sent off to HTTP for processing and we get a PARSE_CONT/ SSL_ERROR_NONE.  Problem we see is that net_read_io() is never re-invoked to resume the remainder of the read.



In our live network environment we can recreate this at will.  One unique item in the live client sequence is just before we get the 3rd GET on the initial SSL session, one of the other three SSL sessions on this client gets terminated.  This was the source of the SSL_ERROR_ZERO_RETURN message we reported on the earlier thread.  So we see the SSL_ERROR_ZERO_RETURN on the unrelated session followed by the incomplete GET processing on the other session on the same client.



If we use our test harness and do the same, less the extra session termination (Start SSL session, send 3 GETs that cause multiple read passes on the 3rd GET) we never see the failure.



Just curious if any of this sounds like any other issues folks my be familiar with in ATS 6.0.0?



Craig S.



Possible problem isolation!!! SSLNetVConnection IOBufferBlock management issue on ATS 6.0.0

Posted by "Craig Schomburg (craigs)" <cr...@cisco.com>.
Susan, or anyone else with knowledge in this area…

If you get a chance to look this over please send me your thoughts/recommendations.  We are preparing for a product software release and needed to reach closure on this issue before we can release.  We understand everyone is busy so no flames are intended by this.  As usual we appreciate all of the input that we have received thus far on this issue.

Thanks,

Craig Schomburg

From: Craig Schomburg <cr...@cisco.com>
Date: Monday, June 6, 2016 at 10:16 PM
To: Susan Hinrichs <sh...@network-geographics.com>, Alan Carroll <so...@yahoo-inc.com>, Chao Xu <xu...@gmail.com>
Cc: "dev@trafficserver.apache.org" <de...@trafficserver.apache.org>, Craig Schomburg <cr...@cisco.com>
Subject: Possible problem isolation!!! SSLNetVConnection IOBufferBlock management issue on ATS 6.0.0

Susan,

I believe we may have found the root of our problem.  Would like your input
to confirm.  Appears that this is a area you are intimately familiar with ☺

So again this is a continuation of the progress update I sent on 05-June
(below)…  So again, single SSL session with multiple successive GETs, stuck
when the initial session buffer is used up…

//
//  Notice that mr (max_read_avail) is increasing with each received packet
//  and eventually maxes out at the buffer size (4096) at which point mw
//  (cur_write_avail) is also ZERO...  No new buffer is allocated due to
//  max_read_avail stuck at 4096...
//
//  Also noticed 2 readers allocated and only one of the two reflects the
//  fact that all data has been processed (read_avail = 0).  Other just
//  keeps growing...
//
//    (net_read_io)> (ssl) craigs net_read_io() mr for readers[0].read_avail=2055
//    (net_read_io)> (ssl) craigs net_read_io() mr for readers[1].read_avail=0
//
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:590 (net_read_io)> (ssl) craigs net_read_io() BEFORE buffer check b=0x55a3d9703cd0, b->next=(nil), b->write_avail()=2041, ntodo=9223372036854775807, wm=0, mr=2055, mw=2041
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:595 (net_read_io)> (ssl) craigs net_read_io() mr for mbuf=0x55a3d9736b80 _writer.read_avail=2055
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:600 (net_read_io)> (ssl) craigs net_read_io() mr for readers[0].read_avail=2055
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:600 (net_read_io)> (ssl) craigs net_read_io() mr for readers[1].read_avail=0
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:207 (ssl_read_from_net)> (ssl) trace=FALSE
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:213 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] b=0x55a3d9703cd0, bnxt=(nil), b->write_avail()=2041
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:220 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] nread=2041
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:231 (ssl_read_from_net)> (ssl) craigs ssl_read_from_net sslErr=0
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:252 (ssl_read_from_net)> (ssl) craigs ssl_read_from_net SSL_ERROR_NONE block_write_avail=0
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:307 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] bytes_read=2041
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:320 (ssl_read_from_net)> (ssl) craigs ssl_read_from_net SSL_READ_READY bytes_read=2041, ntodo=9223372036854773766
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <HttpClientSession.cc:412 (state_keep_alive)> (http_cs) [4] [&HttpClientSession::state_keep_alive, VC_EVENT_READ_READY]
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <HttpClientSession.cc:139 (new_transaction)> (http_cs) [4] Starting transaction 2 for vc=0x55a3d97534c0 using sm [5]
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <HttpSM.cc:2530 (main_handler)> (http) [5] [HttpSM::main_handler, VC_EVENT_READ_READY]
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <HttpSM.cc:574 (state_read_client_request_header)> (http) [5] [&HttpSM::state_read_client_request_header, VC_EVENT_READ_READY]
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <MIME.cc:2454 (mime_scanner_get)> (http) craigs mime_scanner_get() PARSE result=1
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <MIME.cc:2454 (mime_scanner_get)> (http) craigs mime_scanner_get() PARSE result=1
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <MIME.cc:2454 (mime_scanner_get)> (http) craigs mime_scanner_get() PARSE result=1
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <MIME.cc:2454 (mime_scanner_get)> (http) craigs mime_scanner_get() PARSE result=1
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <MIME.cc:2454 (mime_scanner_get)> (http) craigs mime_scanner_get() PARSE result=1
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <MIME.cc:2454 (mime_scanner_get)> (http) craigs mime_scanner_get() PARSE result=1
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <MIME.cc:2454 (mime_scanner_get)> (http) craigs mime_scanner_get() PARSE result=1
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <MIME.cc:2454 (mime_scanner_get)> (http) craigs mime_scanner_get() PARSE result=2
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <HdrTSOnly.cc:86 (parse_req)> (http) craigs parse_req() PARSE result=2
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <HttpSM.cc:626 (state_read_client_request_header)> (http) craigs state_read_client_request_header POST parse_req() event=100, state=2, bytes_used=2041
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <HttpSM.cc:710 (state_read_client_request_header)> (http) craigs state_read_client_request_header PARSE_CONT last else leg
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <HttpSM.cc:720 (state_read_client_request_header)> (http) craigs state_read_client_request_header PARSE_CONT/VC_EVENT_CONT called reenable()
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <UnixNet.cc:516 (mainNetEvent)> (ssl) craigs RRL post dequeue net_read_io() [1] called from mainNetEvent(), vc=0x55a3d97534c0
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:590 (net_read_io)> (ssl) craigs net_read_io() BEFORE buffer check b=0x55a3d9703cd0, b->next=(nil), b->write_avail()=0, ntodo=9223372036854775807, wm=0, mr=4096, mw=0
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:595 (net_read_io)> (ssl) craigs net_read_io() mr for mbuf=0x55a3d9736b80 _writer.read_avail=4096
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:600 (net_read_io)> (ssl) craigs net_read_io() mr for readers[0].read_avail=4096
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:600 (net_read_io)> (ssl) craigs net_read_io() mr for readers[1].read_avail=0
2016-06-07T04:51:10.961+08:00 ts[11903]: {0x2b3dcd4fa580} DEBUG: <SSLNetVConnection.cc:606 (net_read_io)> (ssl) craigs net_read_io() read_disable b=0x55a3d9703cd0, b->next=(nil), b->write_avail()=0, ntodo=9223372036854775807


//
//  Looked at HttpClientSession::new_connection() and found that ATS allowed
//  2 readers to be defined for the session.  In this case the input MIOBuffer
//  and IOBufferReader were NULL.  However, ssl_vc->get_ssl_reader() returned
//  an allocated reader and then ATS turned right around and allocated and
//  overwrote the sm_reader with another new reader because the input reader
//  was NULL... walla 2 readers...
//

void
HttpClientSession::new_connection(NetVConnection *new_vc, MIOBuffer *iobuf, IOBufferReader *reader, bool backdoor)
{
  <snip...>

  if (!iobuf) {
    SSLNetVConnection *ssl_vc = dynamic_cast<SSLNetVConnection *>(new_vc);
    if (ssl_vc) {
      iobuf = ssl_vc->get_ssl_iobuf();
      sm_reader = ssl_vc->get_ssl_reader();
      DebugHttpSsn("craigs new_connection !iobuf ssl_vc alloc MIOBuffer=%p existing IOBufferReader=%p", iobuf, sm_reader);
    }
  }

  DebugHttpSsn("craigs new_connection input MIOBuffer=%p IOBufferReader=%p", iobuf, reader);
  read_buffer = iobuf ? iobuf : new_MIOBuffer(HTTP_HEADER_BUFFER_SIZE_INDEX);
  sm_reader = reader ? reader : read_buffer->alloc_reader();
  DebugHttpSsn("craigs new_connection final MIOBuffer=%p IOBufferReader=%p", read_buffer, sm_reader);


2016-06-07T04:51:16.470+08:00 ts[11903]: {0x2b3dd11d4700} DEBUG: <HttpSessionAccept.cc:60 (accept)> (http_seq) [HttpSessionAccept:mainEvent 0x55a3d9752de0] accepted connection from 10.150.34.84:55476 transport type = 4
2016-06-07T04:51:16.470+08:00 ts[11903]: {0x2b3dd11d4700} DEBUG: <HttpClientSession.cc:201 (new_connection)> (http_cs) [6] session born, netvc 0x55a3d9752de0
2016-06-07T04:51:16.470+08:00 ts[11903]: {0x2b3dd11d4700} DEBUG: <HttpClientSession.cc:208 (new_connection)> (http_cs) craigs new_connection !iobuf ssl_vc alloc MIOBuffer=0x55a3d97366d0 existing IOBufferReader=0x55a3d97366e8
2016-06-07T04:51:16.470+08:00 ts[11903]: {0x2b3dd11d4700} DEBUG: <HttpClientSession.cc:212 (new_connection)> (http_cs) craigs new_connection input MIOBuffer=0x55a3d97366d0 IOBufferReader=(nil)
2016-06-07T04:51:16.470+08:00 ts[11903]: {0x2b3dd11d4700} DEBUG: <P_IOBuffer.h:832 (alloc_reader)> (ssl) craigs MIOBuffer::alloc_reader() reader-idx=1, MIOBuffer=0x55a3d97366d0, IOBufferReader=0x55a3d9736710
2016-06-07T04:51:16.470+08:00 ts[11903]: {0x2b3dd11d4700} DEBUG: <HttpClientSession.cc:215 (new_connection)> (http_cs) craigs new_connection final MIOBuffer=0x55a3d97366d0 IOBufferReader=0x55a3d9736710
2016-06-07T04:51:16.470+08:00 ts[11903]: {0x2b3dd11d4700} DIAG: (drop_failed_auth) Session Start

//
//  Made the following modification to use the input reader if defined, else
//  the ssl_vc->get_ssl_reader() and lastly allocate a new reader only if
//  neither of the previous readers was defined.  Gave this a try and our
//  issue no longer occurs.  Reads from multiple GETs on the same session
//  are processed, buffer wraps are handled, and all "appears" to be happy.
//

  DebugHttpSsn("craigs new_connection input MIOBuffer=%p IOBufferReader=%p", iobuf, reader);
  if (!iobuf) {
    SSLNetVConnection *ssl_vc = dynamic_cast<SSLNetVConnection *>(new_vc);
    if (ssl_vc) {
      iobuf = ssl_vc->get_ssl_iobuf();
      reader = reader ? reader : ssl_vc->get_ssl_reader();   // Modified line
      DebugHttpSsn("craigs new_connection !iobuf ssl_vc alloc MIOBuffer=%p input or existing IOBufferReader=%p", iobuf, reader);
    }
  }

  read_buffer = iobuf ? iobuf : new_MIOBuffer(HTTP_HEADER_BUFFER_SIZE_INDEX);
  sm_reader = reader ? reader : read_buffer->alloc_reader();
  DebugHttpSsn("craigs new_connection final MIOBuffer=%p IOBufferReader=%p", read_buffer, sm_reader);


Now, with the above working, I looked at the latest code in GIT Hub and now
see that the above functionality was rewritten via:

   TS-3612: Restructure Proxy Client Sessions to support transaction oriented
            Sessions execute transaction hooks and connection oriented Sessions
            execute session hooks

   https://issues.apache.org/jira/browse/TS-3612.

Susan, I see you were the author of this rewrite.  Also looking at the code this
appears to be a rather substantial rewrite that went into 6.2.0.  Also appears
that there may be a number of dependencies on other TS-cases?

So my question is might the change I implemented above an acceptable option
to address this issue as an interim solution until we can upgrade to 6.2.0 or
might there be other issues if we use my simple fix approach above?  Any other
options for a simple fix short of pulling in TS-3612?

If TS-3612 is needed is this a reasonable patchable effort for ATS 6.2.0?  Are
there other TS-cases that I would be needed in conjunction with TS-3612?

Thanks again for all of the insight from Susan, Alan, and Chao thus far.

Craig Schomburg



From: Craig Schomburg <cr...@cisco.com>
Date: Sunday, June 5, 2016 at 10:14 PM
To: Susan Hinrichs <sh...@network-geographics.com>, Alan Carroll <so...@yahoo-inc.com>, Chao Xu <xu...@gmail.com>
Cc: Craig Schomburg <cr...@cisco.com>
Subject: UPDATE: SSLNetVConnection IOBufferBlock management issue on ATS 6.0.0


Susan, Alan, and Chao,



Did not send this on the ATS alias due to attached debug log segment.



Spent the weekend continuing the debug efforts but this has been a hard issue to chase down as traffic server is a new area for me.  Wanted to provide a bit more information to see if this rings a bell or to get some debug guidance.  Hoping that this information makes a bit more sense and hoping you folks have time to provide guidance based on your expertise with this code base.  We are on ATS 6.0.0 at this time.



The original information below from last Friday’s email is still true with the exception of the statement below that stated that we never saw net_read_io() resume on that session.



Here is an updated description as we have progressed in our investigation this weekend…



·         As stated below, we have a remote Client that generates a total of 4 SSL sessions at about the same time.

·         Data was only sent on 2 of the 4 sessions.  One of the two sessions terminated after sending a single HTTP/1.1 GET request so no issues were seen on that session.  The other two sessions were simply opened and then closed (not sure what their use was).

·         On the session that we hit the problem on, it issues multiple GET requests on the same sessions waiting for a ACK on the previous GET before sending the next.



So here is the updated information…



·         As each GET is received, we see the call made in net_read_io() to check the buffer status before proceeding with the next read…

  if (ntodo <= 0 || !buf.writer()->write_avail()) {

    read_disable(nh, this);

    return;

  }

With the preceding GET request calls, since there was still buffer space available, no new buffer was allocated as a result of this call.  That being said, we added debug code to dump the max_read_avail, cur_write_avail, and watermark values just prior to this call.  What we see is that the max_read_avail increases with each packet based on the received packet size and likewise cur_write_avail decreases.  Expected to see cur_write_avail decrease but at the moment we can not explain while max_read_avail was not “consumed” as each preceding packet was received, processed, and ack’ed (200 OK).



·         When the GET request was finally received that exceeded cur_write_avail, the amount remaining in cur_write_avail was read and HTTP/MIME parse returned PARSE_CONT.  Previously I stated that net_io_read() was not called again and that no further processing occurred.  That was in fact NOT CORRECT.  What happened was net_io_read() was in fact called as was !buf.writer->write_avail().  In this case since max_read_avail was 4096, cur_write_avail was 0, and water_mark was 0 add_block was not called and no buffer allocated.  As a result read_disable was also called.  After the call to read_disable() no further activity was seen in the traffic_server trace on this session.



Short snippet of key data from the included Debug.log. Complete trace of this sequence is in the Debug.log



In the log below…

  wm =  water_mark

  mr =  max_read_avail

  mw = cur_write_avail



//

//  1st GET request

//

2016-06-06T07:59:39.638+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <UnixNet.cc:516 (mainNetEvent)> (ssl) craigs RRL post dequeue net_read_io() [1] called from mainNetEvent(), vc=0x564698572e80

2016-06-06T07:59:39.638+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:590 (net_read_io)> (ssl) craigs net_read_io() BEFOREbuffer check b=0x564698523cd0, b->next=(nil), b->write_avail()=4096, ntodo=9223372036854775807, wm=0, mr=0, mw=4096

2016-06-06T07:59:39.638+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:213 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] b=0x564698523cd0, bnxt=(nil), b->write_avail()=4096

2016-06-06T07:59:39.638+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:220 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] nread=2008

2016-06-06T07:59:39.638+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:252 (ssl_read_from_net)> (ssl) craigs ssl_read_from_net SSL_ERROR_NONE block_write_avail=2088



2016-06-06T07:59:39.653+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:590 (net_read_io)> (ssl) craigs net_read_io() BEFOREbuffer check b=0x564698523cd0, b->next=(nil), b->write_avail()=2088, ntodo=0, wm=0, mr=2008, mw=2088

2016-06-06T07:59:39.653+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:596 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - readers[0].read_avail=2008

2016-06-06T07:59:39.653+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:596 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - readers[1].read_avail=0

2016-06-06T07:59:39.653+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:601 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - _writer.read_avail=2008

2016-06-06T07:59:39.653+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:606 (net_read_io)> (ssl) craigs net_read_io() read_disable b=0x564698523cd0, b->next=(nil), b->write_avail()=2088, ntodo=0



//

//  Above packet processed and 200 OK sent...

//

2016-06-06T07:59:39.736+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <UnixNet.cc:516 (mainNetEvent)> (ssl) craigs RRL post dequeue net_read_io() [1] called from mainNetEvent(), vc=0x564698572e80

2016-06-06T07:59:39.736+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:590 (net_read_io)> (ssl) craigs net_read_io() BEFOREbuffer check b=0x564698523cd0, b->next=(nil), b->write_avail()=2088, ntodo=9223372036854775807, wm=0, mr=2008, mw=2088

2016-06-06T07:59:39.736+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:596 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - readers[0].read_avail=2008

2016-06-06T07:59:39.736+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:596 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - readers[1].read_avail=0

2016-06-06T07:59:39.736+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:601 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - _writer.read_avail=2008

2016-06-06T07:59:39.736+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:213 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] b=0x564698523cd0, bnxt=(nil), b->write_avail()=2088

2016-06-06T07:59:39.736+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:220 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] nread=0



//

//  2nd GET request

//

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <UnixNet.cc:516 (mainNetEvent)> (ssl) craigs RRL post dequeue net_read_io() [1] called from mainNetEvent(), vc=0x564698572e80

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:590 (net_read_io)> (ssl) craigs net_read_io() BEFOREbuffer check b=0x564698523cd0, b->next=(nil), b->write_avail()=2088, ntodo=9223372036854775807, wm=0, mr=2008, mw=2088

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:596 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - readers[0].read_avail=2008

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:596 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - readers[1].read_avail=0

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:601 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - _writer.read_avail=2008

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:213 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] b=0x564698523cd0, bnxt=(nil), b->write_avail()=2088

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:220 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] nread=2088

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:231 (ssl_read_from_net)> (ssl) craigs ssl_read_from_net sslErr=0

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:252 (ssl_read_from_net)> (ssl) craigs ssl_read_from_net SSL_ERROR_NONE block_write_avail=0

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:307 (ssl_read_from_net)> (ssl) [SSL_NetVConnection::ssl_read_from_net] bytes_read=2088

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:320 (ssl_read_from_net)> (ssl) craigs ssl_read_from_net SSL_READ_READY bytes_read=2088, ntodo=9223372036854773719



2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <MIME.cc:2454 (mime_scanner_get)> (http) craigs mime_scanner_get() PARSE result=2

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <HdrTSOnly.cc:86 (parse_req)> (http) craigs parse_req()  PARSE result=2 (PARSE_CONT)

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <HttpSM.cc:626 (state_read_client_request_header)> (http) craigs state_read_client_request_header POST parse_req() event=100, state=2

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <HttpSM.cc:710 (state_read_client_request_header)> (http) craigs state_read_client_request_header PARSE_CONT last else leg

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <HttpSM.cc:720 (state_read_client_request_header)> (http) craigs state_read_client_request_header PARSE_CONT/VC_EVENT_CONT called reenable()



2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <UnixNet.cc:516 (mainNetEvent)> (ssl) craigs RRL post dequeue net_read_io() [1] called from mainNetEvent(), vc=0x564698572e80

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:590 (net_read_io)> (ssl) craigs net_read_io() BEFOREbuffer check b=0x564698523cd0, b->next=(nil), b->write_avail()=0, ntodo=9223372036854775807, wm=0, mr=4096, mw=0

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:596 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - readers[0].read_avail=4096

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:596 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - readers[1].read_avail=0

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:601 (net_read_io)> (ssl) craigs net_read_io() mr NON-ZERO - _writer.read_avail=4096

2016-06-06T07:59:40.032+08:00 traffic_server[12837]: {0x2b8699f01700} DEBUG: <SSLNetVConnection.cc:606 (net_read_io)> (ssl) craigs net_read_io() read_disable b=0x564698523cd0, b->next=(nil), b->write_avail()=0, ntodo=9223372036854775807

//

//  Above is end of failed GET request transaction 2 for vc=0x564698572e80 using sm [6].

//  No further activity was seein in the trace on this session.

//



So looking at the code what we are having problems identifying is how/where consume() gets called on the IOBufferBlock to update _start as each preceding packet and partial packet was processed?



One additional data point is I tried to recreate this same issue with an external test harness creating the session traffic but in that case we did not see this issue.  I was able to send multiple GET requests serially on a single session and see the buffer depletion/recovery and no issues were seen.  In addition I did not see the continuously accumulating max_read_avail and cur_write_avail was decremented by the packet size as each new packet was received and processed.



So at this point I am thoroughly baffled by the differing behaviors between the real recreate and the simulated recreate.  What I am trying to understand is what might be different between these two sessions to cause the difference in behavior processing the HTTP/1.1 GET requests.



Again, any debug pointers would be greatly appreciated as our progress in narrowing in on the root cause has been much to slow.



Thanks,



Craig Schomburg





On 6/3/16, 3:43 PM, "Craig Schomburg (craigs)" <cr...@cisco.com> wrote:



Leif and Susan,



Based on your knowledge of the releases 6.0.0 and beyond, if we were looking at the best blessed release so far would you recommend looking at 6.1.1 (last posted release on the web site) or is there a 6.2 that is blessed as stable and deployed that would be a better option?



Also we are just about to FCS a new internal product release with 6.0.0.  Are there any other known deployments of 6.0.0 that have indicated they are stable?



We are still digging into the SSLNetVConnection IOBufferBlock management issue mentioned on our previous threads but are considering looking at a later release in parallel.



As an update for Susan and Alan, on our SSLNetVConnection IOBufferBlock issue we are progressing further but still digging in on the root cause (slow digging).  Turns out the previous SSL_ERROR_ZERO_RETURN data point we had mentioned was on a different session so was unrelated to this issue.



What we have determined thus far is

·         we have a Client that is generating ~3 SSL sessions in parallel (same Src IP, Different Src-Port).

·         On the session in error, it issues 3 GET requests serially (each is ~1.6K), each packet associated with the GET is in 2 fragments.

·         As each GET HTTP/1.1. is received we work our way through the first IOBufferBlock and on the 3rd we have insufficient buffer space for the complete packet (buffer_write_avail = 0, b->next = NULL).

·         The first part of the 3rd GET is sent off to HTTP for processing and we get a PARSE_CONT/ SSL_ERROR_NONE.  Problem we see is that net_read_io() is never re-invoked to resume the remainder of the read.



In our live network environment we can recreate this at will.  One unique item in the live client sequence is just before we get the 3rd GET on the initial SSL session, one of the other three SSL sessions on this client gets terminated.  This was the source of the SSL_ERROR_ZERO_RETURN message we reported on the earlier thread.  So we see the SSL_ERROR_ZERO_RETURN on the unrelated session followed by the incomplete GET processing on the other session on the same client.



If we use our test harness and do the same, less the extra session termination (Start SSL session, send 3 GETs that cause multiple read passes on the 3rd GET) we never see the failure.



Just curious if any of this sounds like any other issues folks my be familiar with in ATS 6.0.0?



Craig S.