You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@thrift.apache.org by "Buster, James" <Ja...@transunion.com.INVALID> on 2021/05/23 12:12:17 UTC

How do I debug a thrift hang?

My server gets permanently hung after seeing this internal exception, from lib/cpp/src/thrift/transport/TBufferTransports.h:

  void consume(uint32_t len) {
    countConsumedMessageBytes(len);
    if (TDB_LIKELY(static_cast<ptrdiff_t>(len) <= rBound_ - rBase_)) {
      rBase_ += len;
    } else {
      throw TTransportException(TTransportException::BAD_ARGS, "consume did not follow a borrow.");
    }
  }

Once this happens the server becomes unresponsive and all new clients connect and then hang until TCP times out.
The thread stuck in epoll_wait acts as if it's ignoring everything after the connection is established. It can take anywhere
from 10 minutes to 23 hours of heavy use before this hang condition occurs, so it's hard to predict and there's no clear
test case (because if I had one I presumably could make it hang immediately).

RE: How do I debug a thrift hang?

Posted by "Buster, James" <Ja...@transunion.com.INVALID>.
I'm not sure how DNS is involved here, this exception looks like a Thrift internal error. Unfortunately I don't know enough
about Thrift internals to know how this error can occur, and it appears Thrift doesn't properly recover from the exception,
causing the hanging I'm seeing. This looks like an exception that is expected to never happen.

-----Original Message-----
I'm not sure it is the same problem, but last time I had an hanging in the TTransport part it was due to a DNS misconfiguration that lead to big delays in all functions based on the dns resolver.

On Sun, 23 May 2021 at 14:12, Buster, James <Ja...@transunion.com.invalid> wrote:

> My server gets permanently hung after seeing this internal exception, 
> from
> lib/cpp/src/thrift/transport/TBufferTransports.h:
>
>   void consume(uint32_t len) {
>     countConsumedMessageBytes(len);
>     if (TDB_LIKELY(static_cast<ptrdiff_t>(len) <= rBound_ - rBase_)) {
>       rBase_ += len;
>     } else {
>       throw TTransportException(TTransportException::BAD_ARGS, 
> "consume did not follow a borrow.");
>     }
>   }
>
> Once this happens the server becomes unresponsive and all new clients 
> connect and then hang until TCP times out.
> The thread stuck in epoll_wait acts as if it's ignoring everything 
> after the connection is established. It can take anywhere from 10 
> minutes to 23 hours of heavy use before this hang condition occurs, so 
> it's hard to predict and there's no clear test case (because if I had 
> one I presumably could make it hang immediately).
>

Re: How do I debug a thrift hang?

Posted by Paolo Elefante <pa...@gmail.com>.
I'm not sure it is the same problem, but last time I had an hanging in the
TTransport part it was due to a DNS misconfiguration that lead to big
delays in all functions based on the dns resolver.

On Sun, 23 May 2021 at 14:12, Buster, James
<Ja...@transunion.com.invalid> wrote:

> My server gets permanently hung after seeing this internal exception, from
> lib/cpp/src/thrift/transport/TBufferTransports.h:
>
>   void consume(uint32_t len) {
>     countConsumedMessageBytes(len);
>     if (TDB_LIKELY(static_cast<ptrdiff_t>(len) <= rBound_ - rBase_)) {
>       rBase_ += len;
>     } else {
>       throw TTransportException(TTransportException::BAD_ARGS, "consume
> did not follow a borrow.");
>     }
>   }
>
> Once this happens the server becomes unresponsive and all new clients
> connect and then hang until TCP times out.
> The thread stuck in epoll_wait acts as if it's ignoring everything after
> the connection is established. It can take anywhere
> from 10 minutes to 23 hours of heavy use before this hang condition
> occurs, so it's hard to predict and there's no clear
> test case (because if I had one I presumably could make it hang
> immediately).
>