You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafficserver.apache.org by "Sudheer Vinukonda (JIRA)" <ji...@apache.org> on 2014/12/08 19:31:12 UTC

[jira] [Comment Edited] (TS-3226) SSL data not read from the socket sometimes causing transactions to timeout

    [ https://issues.apache.org/jira/browse/TS-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238206#comment-14238206 ] 

Sudheer Vinukonda edited comment on TS-3226 at 12/8/14 6:30 PM:
----------------------------------------------------------------

I am still a little bit as to unsure how the values of {{nbytes}} and {{ndone}} could get restored inside a mutex to the correct values. For example, in the specific scenario I am debugging, the {{net_read_io}} could be called by the socket input on one side and tunnel read on the other side. The tunnel read sets the {{nbytes}} and {{ndone}} in the {{do_io_read}} which itself does not lock. 

Below are the logs from the little instrumentation that I made, showing that the ndone/nbytes values before the mutex were all "0", while after the mutex were the correct/expected values.


{code}
 //ntodo_t, s_ntodo_t, nbytes_t, ndone_t are calculated before acquiring the mutex..

  if (ntodo_t != (s->vio.nbytes - s->vio.ndone)) {
    TraceIn(true, get_remote_addr(), get_remote_port(),
          "prelock vio %p, afterlock vio %p, before lock, ntodo_t:s_ntodo_t:nbytes_t:ndone_t after lock, ntodo:s_ntodo:nbytes:ndone before checking..%" PRId64 ":%" PRId64 ":%" PRId64 ":%" PRId64 ":%" PRId64 ":%" PRId64 ":%" PRId64 ":%" PRId64, pre_lock_vio, &(s->vio), ntodo_t, s_ntodo_t, nbytes_t, ndone_t, ntodo, s->vio.ntodo(), s->vio.nbytes, s->vio.ndone);
  }
{code}

{code}

-bash-4.1$ tail -f error.log | grep -i prelock
20141208.18h21m39s prelock vio 0x2ae86c0cd760, afterlock vio 0x2ae86c0cd760, before lock, ntodo_t:s_ntodo_t:nbytes_t:ndone_t after lock, ntodo:s_ntodo:nbytes:ndone before checking..0:0:0:0:2054:2054:2054:0
20141208.18h22m56s prelock vio 0x2ae85c002ad0, afterlock vio 0x2ae85c002ad0, before lock, ntodo_t:s_ntodo_t:nbytes_t:ndone_t after lock, ntodo:s_ntodo:nbytes:ndone before checking..0:0:0:0:2693:2693:2693:0
20141208.18h23m10s prelock vio 0x2ae86c52b1d0, afterlock vio 0x2ae86c52b1d0, before lock, ntodo_t:s_ntodo_t:nbytes_t:ndone_t after lock, ntodo:s_ntodo:nbytes:ndone before checking..0:0:0:0:9223372036854775807:9223372036854775807:9223372036854775807:0
20141208.18h23m28s prelock vio 0x2ae86c619390, afterlock vio 0x2ae86c619390, before lock, ntodo_t:s_ntodo_t:nbytes_t:ndone_t after lock, ntodo:s_ntodo:nbytes:ndone before checking..0:65:65:0:65:65:65:0
{code}


was (Author: sudheerv):
I am still a little bit unsure how the values of {{nbytes}} and {{ndone}} could get restored inside a mutex to the correct values. For example, in the specific scenario I am debugging, the {{net_read_io}} could be called by the socket input on one side and tunnel read on the other side. The tunnel read sets the {{nbytes}} and {{ndone}} in the {{do_io_read}} which itself does not lock. 

Below are the logs from the little instrumentation that I made, showing that the ndone/nbytes values before the mutex were all "0", while after the mutex were the correct/expected values.


{code}
 //ntodo_t, s_ntodo_t, nbytes_t, ndone_t are calculated before acquiring the mutex..

  if (ntodo_t != (s->vio.nbytes - s->vio.ndone)) {
    TraceIn(true, get_remote_addr(), get_remote_port(),
          "prelock vio %p, afterlock vio %p, before lock, ntodo_t:s_ntodo_t:nbytes_t:ndone_t after lock, ntodo:s_ntodo:nbytes:ndone before checking..%" PRId64 ":%" PRId64 ":%" PRId64 ":%" PRId64 ":%" PRId64 ":%" PRId64 ":%" PRId64 ":%" PRId64, pre_lock_vio, &(s->vio), ntodo_t, s_ntodo_t, nbytes_t, ndone_t, ntodo, s->vio.ntodo(), s->vio.nbytes, s->vio.ndone);
  }
{code}

{code}

-bash-4.1$ tail -f error.log | grep -i prelock
20141208.18h21m39s prelock vio 0x2ae86c0cd760, afterlock vio 0x2ae86c0cd760, before lock, ntodo_t:s_ntodo_t:nbytes_t:ndone_t after lock, ntodo:s_ntodo:nbytes:ndone before checking..0:0:0:0:2054:2054:2054:0
20141208.18h22m56s prelock vio 0x2ae85c002ad0, afterlock vio 0x2ae85c002ad0, before lock, ntodo_t:s_ntodo_t:nbytes_t:ndone_t after lock, ntodo:s_ntodo:nbytes:ndone before checking..0:0:0:0:2693:2693:2693:0
20141208.18h23m10s prelock vio 0x2ae86c52b1d0, afterlock vio 0x2ae86c52b1d0, before lock, ntodo_t:s_ntodo_t:nbytes_t:ndone_t after lock, ntodo:s_ntodo:nbytes:ndone before checking..0:0:0:0:9223372036854775807:9223372036854775807:9223372036854775807:0
20141208.18h23m28s prelock vio 0x2ae86c619390, afterlock vio 0x2ae86c619390, before lock, ntodo_t:s_ntodo_t:nbytes_t:ndone_t after lock, ntodo:s_ntodo:nbytes:ndone before checking..0:65:65:0:65:65:65:0
{code}

> SSL data not read from the socket sometimes causing transactions to timeout
> ---------------------------------------------------------------------------
>
>                 Key: TS-3226
>                 URL: https://issues.apache.org/jira/browse/TS-3226
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: SSL
>    Affects Versions: 5.1.1
>            Reporter: Sudheer Vinukonda
>            Assignee: Sudheer Vinukonda
>             Fix For: 5.2.0
>
>
> We have had a really long standing problem where some of our origins were complaining of receiving POST requests with non-zero content-length header, but, no body (or sometimes, partial body). Due to the way our network was setup, this problem was not easy to be isolated due to the multiple hops along the way. The post body could be lost anywhere along the path (e.g. client, dns, routers/vips, edge, data center etc). After a lot of debugging and with the help of some custom-built wire traces for SSL, we managed to isolate the problem to our ATS hosts running on our edge layer. From the wire traces, we could see that, the post body is coming in alright, but is just sitting in the socket and not being read by the post ua tunnel producer.
> After further investigation, it seems that the producer is issuing the correct do_io_read for the required number of bytes, but, there seems to be a bug in the {{SSLNetVConnection::net_read_io}}, where the ntodo is being calculated before acquiring the mutex on the read vio.
> https://github.com/apache/trafficserver/blob/master/iocore/net/SSLNetVConnection.cc#L391
> Instrumenting the code with further debug traces showed that, in the failed transactions, I am noticing the ntodo being "0" when determined before the mutex, whereas the (s->vio.nbytes - s->vio.ndone) is non-zero after the mutex. I am not sure to understand how the nbytes on the read vio object can be different before acquiring mutex, but, moving the ntodo calculation after mutex seems to have resolved the problem. Note that this is how it is done in the corresponding function {{read_from_net}} in {{UnixNetVConnection}}.
> Talking to [~amc] on the IRC, it seems that the mutex is needed coz, the {{SSLNetVConnection::net_read_io}} could also be triggered by an incoming socket data before the {{UnixNetVConnection::do_io_read}} could trigger it and that could mess up the nbytes/ndone in the read vio.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)