You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@storm.apache.org by Li Wang <wa...@gmail.com> on 2017/06/14 08:58:55 UTC

Re: About client session timed out, have not

I merged to track the source of error. I found the first error message is
on nimbus.log:  "o.a.s.d.nimbus [INFO] Executor T1-1-1497424747:[8 8] not
alive". The nimbus detects some executors are not alive and thus make
reassignment which cause the worker restart.

I did not find any error in the supervisor and worker logs that would
causes an executor to be timeout. Since my topology is data-intensive, is
it possible that the heartbeat messages was not delivered in time and thus
caused the timeout of executors in nimbus? How does the an executor send
heartbeat to the nimbus? Is it through zookeeper, or just transmit via
netty like a metric tuple?

I will first try to solve the problem by increasing timeout value and will
let you guys know if it works.

Regards,
Li Wang

On 14 June 2017 at 11:27, Li Wang <wa...@gmail.com> wrote:

> Hi all,
>
> We deployed a data-intesive topology which involves in a lot of HDFS
> access via HDFS client. We found that after the topology has been executed
> for about half an hour, the topology throughput occasionally drops to zero
> for tens of seconds and sometimes the worker is shutdown without any error
> messages.
>
> I checked the log thoroughly, found nothing wrong but a info message  that
> reads “ClientCnxn [INFO] Client session timed out, have not head from
> server in 13333ms for sessioned …”. I am not sure how this message is
> related to the wired behavior of my topology. But every time my topology
> behaves abnormally, this message happens to show up in the log.
>
> Any help or suggestion is highly appreciated.
>
> Thanks,
> Li Wang.
>
>
>

Re: About client session timed out, have not

Posted by Bobby Evans <ev...@yahoo-inc.com.INVALID>.

Turn on garbage collection logging on your workers.  We have found that GC is typically the culprit in cases like this.  A large stop the world GC happens which makes all threads wait for it so the heartbeats time out and things get rescheduled.
Once you know it is GC then you need to take a look at increasing your heap (probably a good first step) and then monitoring what is on the heap and do you have a memory leak or what?  This can some times be difficult as storm has lots of buffers so you may also want to turn on the logging metrics consumer so you can start to see if some of the buffers in your topology are full and if so by how much.

- Bobby

On Wednesday, June 14, 2017, 3:59:05 AM CDT, Li Wang <wa...@gmail.com> wrote:

I merged to track the source of error. I found the first error message is
on nimbus.log:  "o.a.s.d.nimbus [INFO] Executor T1-1-1497424747:[8 8] not
alive". The nimbus detects some executors are not alive and thus make
reassignment which cause the worker restart.

I did not find any error in the supervisor and worker logs that would
causes an executor to be timeout. Since my topology is data-intensive, is
it possible that the heartbeat messages was not delivered in time and thus
caused the timeout of executors in nimbus? How does the an executor send
heartbeat to the nimbus? Is it through zookeeper, or just transmit via
netty like a metric tuple?

I will first try to solve the problem by increasing timeout value and will
let you guys know if it works.

Regards,
Li Wang

On 14 June 2017 at 11:27, Li Wang <wa...@gmail.com> wrote:

> Hi all,
>
> We deployed a data-intesive topology which involves in a lot of HDFS
> access via HDFS client. We found that after the topology has been executed
> for about half an hour, the topology throughput occasionally drops to zero
> for tens of seconds and sometimes the worker is shutdown without any error
> messages.
>
> I checked the log thoroughly, found nothing wrong but a info message  that
> reads “ClientCnxn [INFO] Client session timed out, have not head from
> server in 13333ms for sessioned …”. I am not sure how this message is
> related to the wired behavior of my topology. But every time my topology
> behaves abnormally, this message happens to show up in the log.
>
> Any help or suggestion is highly appreciated.
>
> Thanks,
> Li Wang.
>
>
>