You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Renjie Liu <li...@gmail.com> on 2015/11/01 09:56:44 UTC

Strange storm problem

Hi, storm community:

We have a storm cluster deployed with 15 workers and recently we often
experience failure since ack timeout. Our input source is kafka and we used
ganglia to monitor our cluster. Recently we experience failures every 12
hours and following are my observations from some monitoring tools when
problem happens:

   1. Topology page shows that no worker was down since uptime of each task
   are nearly equal to topology uptime
   2. I've checked ganglia, the cpu report and mem report does not give any
   clue about the problem. But network report shows something unusual: the in
   speed decreases a little while the out speed decreases to nearly zero on
   some workers.
   3. I've logged in to one of machines mentioned above, and found out that
   one of the survivor areas always remains 100% full.
   4. dstat show that csw turns to 4k+ every few seconds while it remains
   around 400 in normal condition.

Can anyone give us some hint about this problem?

Re: Strange storm problem

Posted by Santosh Pingale <pi...@gmail.com>.

Share storm UI screen shot so that we can have a look at stats of topology.
Also a visualization screen shot to know the flow.

On Mon, Nov 2, 2015 at 10:25 AM, Renjie Liu <li...@gmail.com> wrote:

> The output speed is measured by the outpu of dstat, which shows the worker
> traffic speed.
>
> On Mon, Nov 2, 2015 at 10:52 AM, Nathan Leung <nc...@gmail.com> wrote:
>
>> How are you measuring output speed?  Is it possible that you are
>> experiencing problems with HBase?
>>
>> On Sun, Nov 1, 2015 at 9:22 PM, Renjie Liu <li...@gmail.com>
>> wrote:
>>
>>> The result of jstat shows that it's not in full gc cycle but the minor
>>> gc takes more than 1s each time. However, the frequence of minor gc is
>>> quite low, which happens once every few seconds.
>>>
>>> On Mon, Nov 2, 2015 at 12:29 AM, Nathan Leung <nc...@gmail.com> wrote:
>>>
>>>> The box with no throughput might be in a gc loop. Check your heap
>>>> utilization and maybe increase worker heap if necessary. Also consider
>>>> decreasing the max spout pending, even without further details 20k seems
>>>> high.
>>>> On Nov 1, 2015 10:50 AM, "Harsha" <st...@harsha.io> wrote:
>>>>
>>>>> Do you have any calls to external data sources which might be
>>>>> increasing the latency and causing tuple timeout?
>>>>>
>>>>>
>>>>> On Sun, Nov 1, 2015, at 04:49 AM, Renjie Liu wrote:
>>>>>
>>>>> Yes, I've set it to 20000
>>>>>
>>>>> On Sun, Nov 1, 2015 at 6:40 PM, Santosh Pingale <
>>>>> pingalesantosh@gmail.com> wrote:
>>>>>
>>>>> Have you set 'topology.*max*.*spout*.*pending'?*
>>>>>
>>>>> On Sun, Nov 1, 2015 at 2:26 PM, Renjie Liu <li...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi, storm community:
>>>>>
>>>>> We have a storm cluster deployed with 15 workers and recently we often
>>>>> experience failure since ack timeout. Our input source is kafka and we used
>>>>> ganglia to monitor our cluster. Recently we experience failures every 12
>>>>> hours and following are my observations from some monitoring tools when
>>>>> problem happens:
>>>>>
>>>>>    1. Topology page shows that no worker was down since uptime of
>>>>>    each task are nearly equal to topology uptime
>>>>>    2. I've checked ganglia, the cpu report and mem report does not
>>>>>    give any clue about the problem. But network report shows something
>>>>>    unusual: the in speed decreases a little while the out speed decreases to
>>>>>    nearly zero on some workers.
>>>>>    3. I've logged in to one of machines mentioned above, and found
>>>>>    out that one of the survivor areas always remains 100% full.
>>>>>    4. dstat show that csw turns to 4k+ every few seconds while it
>>>>>    remains around 400 in normal condition.
>>>>>
>>>>> Can anyone give us some hint about this problem?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Renjie Liu
>>>>> Department of Computer Science & Engineering
>>>>> Shanghai JiaoTong University
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Renjie Liu
>>> Department of Computer Science & Engineering
>>> Shanghai JiaoTong University
>>>
>>
>>
>
>
> --
> Renjie Liu
> Department of Computer Science & Engineering
> Shanghai JiaoTong University
>

Re: Strange storm problem

Posted by Renjie Liu <li...@gmail.com>.

The output speed is measured by the outpu of dstat, which shows the worker
traffic speed.

On Mon, Nov 2, 2015 at 10:52 AM, Nathan Leung <nc...@gmail.com> wrote:

> How are you measuring output speed?  Is it possible that you are
> experiencing problems with HBase?
>
> On Sun, Nov 1, 2015 at 9:22 PM, Renjie Liu <li...@gmail.com>
> wrote:
>
>> The result of jstat shows that it's not in full gc cycle but the minor gc
>> takes more than 1s each time. However, the frequence of minor gc is quite
>> low, which happens once every few seconds.
>>
>> On Mon, Nov 2, 2015 at 12:29 AM, Nathan Leung <nc...@gmail.com> wrote:
>>
>>> The box with no throughput might be in a gc loop. Check your heap
>>> utilization and maybe increase worker heap if necessary. Also consider
>>> decreasing the max spout pending, even without further details 20k seems
>>> high.
>>> On Nov 1, 2015 10:50 AM, "Harsha" <st...@harsha.io> wrote:
>>>
>>>> Do you have any calls to external data sources which might be
>>>> increasing the latency and causing tuple timeout?
>>>>
>>>>
>>>> On Sun, Nov 1, 2015, at 04:49 AM, Renjie Liu wrote:
>>>>
>>>> Yes, I've set it to 20000
>>>>
>>>> On Sun, Nov 1, 2015 at 6:40 PM, Santosh Pingale <
>>>> pingalesantosh@gmail.com> wrote:
>>>>
>>>> Have you set 'topology.*max*.*spout*.*pending'?*
>>>>
>>>> On Sun, Nov 1, 2015 at 2:26 PM, Renjie Liu <li...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi, storm community:
>>>>
>>>> We have a storm cluster deployed with 15 workers and recently we often
>>>> experience failure since ack timeout. Our input source is kafka and we used
>>>> ganglia to monitor our cluster. Recently we experience failures every 12
>>>> hours and following are my observations from some monitoring tools when
>>>> problem happens:
>>>>
>>>>    1. Topology page shows that no worker was down since uptime of each
>>>>    task are nearly equal to topology uptime
>>>>    2. I've checked ganglia, the cpu report and mem report does not
>>>>    give any clue about the problem. But network report shows something
>>>>    unusual: the in speed decreases a little while the out speed decreases to
>>>>    nearly zero on some workers.
>>>>    3. I've logged in to one of machines mentioned above, and found out
>>>>    that one of the survivor areas always remains 100% full.
>>>>    4. dstat show that csw turns to 4k+ every few seconds while it
>>>>    remains around 400 in normal condition.
>>>>
>>>> Can anyone give us some hint about this problem?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Renjie Liu
>>>> Department of Computer Science & Engineering
>>>> Shanghai JiaoTong University
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Renjie Liu
>> Department of Computer Science & Engineering
>> Shanghai JiaoTong University
>>
>
>


-- 
Renjie Liu
Department of Computer Science & Engineering
Shanghai JiaoTong University

Re: Strange storm problem

Posted by Nathan Leung <nc...@gmail.com>.

How are you measuring output speed?  Is it possible that you are
experiencing problems with HBase?

On Sun, Nov 1, 2015 at 9:22 PM, Renjie Liu <li...@gmail.com> wrote:

> The result of jstat shows that it's not in full gc cycle but the minor gc
> takes more than 1s each time. However, the frequence of minor gc is quite
> low, which happens once every few seconds.
>
> On Mon, Nov 2, 2015 at 12:29 AM, Nathan Leung <nc...@gmail.com> wrote:
>
>> The box with no throughput might be in a gc loop. Check your heap
>> utilization and maybe increase worker heap if necessary. Also consider
>> decreasing the max spout pending, even without further details 20k seems
>> high.
>> On Nov 1, 2015 10:50 AM, "Harsha" <st...@harsha.io> wrote:
>>
>>> Do you have any calls to external data sources which might be increasing
>>> the latency and causing tuple timeout?
>>>
>>>
>>> On Sun, Nov 1, 2015, at 04:49 AM, Renjie Liu wrote:
>>>
>>> Yes, I've set it to 20000
>>>
>>> On Sun, Nov 1, 2015 at 6:40 PM, Santosh Pingale <
>>> pingalesantosh@gmail.com> wrote:
>>>
>>> Have you set 'topology.*max*.*spout*.*pending'?*
>>>
>>> On Sun, Nov 1, 2015 at 2:26 PM, Renjie Liu <li...@gmail.com>
>>> wrote:
>>>
>>> Hi, storm community:
>>>
>>> We have a storm cluster deployed with 15 workers and recently we often
>>> experience failure since ack timeout. Our input source is kafka and we used
>>> ganglia to monitor our cluster. Recently we experience failures every 12
>>> hours and following are my observations from some monitoring tools when
>>> problem happens:
>>>
>>>    1. Topology page shows that no worker was down since uptime of each
>>>    task are nearly equal to topology uptime
>>>    2. I've checked ganglia, the cpu report and mem report does not give
>>>    any clue about the problem. But network report shows something unusual: the
>>>    in speed decreases a little while the out speed decreases to nearly zero on
>>>    some workers.
>>>    3. I've logged in to one of machines mentioned above, and found out
>>>    that one of the survivor areas always remains 100% full.
>>>    4. dstat show that csw turns to 4k+ every few seconds while it
>>>    remains around 400 in normal condition.
>>>
>>> Can anyone give us some hint about this problem?
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Renjie Liu
>>> Department of Computer Science & Engineering
>>> Shanghai JiaoTong University
>>>
>>>
>>>
>>
>
>
> --
> Renjie Liu
> Department of Computer Science & Engineering
> Shanghai JiaoTong University
>

Re: Strange storm problem

Posted by Renjie Liu <li...@gmail.com>.

The result of jstat shows that it's not in full gc cycle but the minor gc
takes more than 1s each time. However, the frequence of minor gc is quite
low, which happens once every few seconds.

On Mon, Nov 2, 2015 at 12:29 AM, Nathan Leung <nc...@gmail.com> wrote:

> The box with no throughput might be in a gc loop. Check your heap
> utilization and maybe increase worker heap if necessary. Also consider
> decreasing the max spout pending, even without further details 20k seems
> high.
> On Nov 1, 2015 10:50 AM, "Harsha" <st...@harsha.io> wrote:
>
>> Do you have any calls to external data sources which might be increasing
>> the latency and causing tuple timeout?
>>
>>
>> On Sun, Nov 1, 2015, at 04:49 AM, Renjie Liu wrote:
>>
>> Yes, I've set it to 20000
>>
>> On Sun, Nov 1, 2015 at 6:40 PM, Santosh Pingale <pingalesantosh@gmail.com
>> > wrote:
>>
>> Have you set 'topology.*max*.*spout*.*pending'?*
>>
>> On Sun, Nov 1, 2015 at 2:26 PM, Renjie Liu <li...@gmail.com>
>> wrote:
>>
>> Hi, storm community:
>>
>> We have a storm cluster deployed with 15 workers and recently we often
>> experience failure since ack timeout. Our input source is kafka and we used
>> ganglia to monitor our cluster. Recently we experience failures every 12
>> hours and following are my observations from some monitoring tools when
>> problem happens:
>>
>>    1. Topology page shows that no worker was down since uptime of each
>>    task are nearly equal to topology uptime
>>    2. I've checked ganglia, the cpu report and mem report does not give
>>    any clue about the problem. But network report shows something unusual: the
>>    in speed decreases a little while the out speed decreases to nearly zero on
>>    some workers.
>>    3. I've logged in to one of machines mentioned above, and found out
>>    that one of the survivor areas always remains 100% full.
>>    4. dstat show that csw turns to 4k+ every few seconds while it
>>    remains around 400 in normal condition.
>>
>> Can anyone give us some hint about this problem?
>>
>>
>>
>>
>>
>>
>> --
>> Renjie Liu
>> Department of Computer Science & Engineering
>> Shanghai JiaoTong University
>>
>>
>>
>


-- 
Renjie Liu
Department of Computer Science & Engineering
Shanghai JiaoTong University

Re: Strange storm problem

Posted by Nathan Leung <nc...@gmail.com>.

The box with no throughput might be in a gc loop. Check your heap
utilization and maybe increase worker heap if necessary. Also consider
decreasing the max spout pending, even without further details 20k seems
high.
On Nov 1, 2015 10:50 AM, "Harsha" <st...@harsha.io> wrote:

> Do you have any calls to external data sources which might be increasing
> the latency and causing tuple timeout?
>
>
> On Sun, Nov 1, 2015, at 04:49 AM, Renjie Liu wrote:
>
> Yes, I've set it to 20000
>
> On Sun, Nov 1, 2015 at 6:40 PM, Santosh Pingale <pi...@gmail.com>
> wrote:
>
> Have you set 'topology.*max*.*spout*.*pending'?*
>
> On Sun, Nov 1, 2015 at 2:26 PM, Renjie Liu <li...@gmail.com>
> wrote:
>
> Hi, storm community:
>
> We have a storm cluster deployed with 15 workers and recently we often
> experience failure since ack timeout. Our input source is kafka and we used
> ganglia to monitor our cluster. Recently we experience failures every 12
> hours and following are my observations from some monitoring tools when
> problem happens:
>
>    1. Topology page shows that no worker was down since uptime of each
>    task are nearly equal to topology uptime
>    2. I've checked ganglia, the cpu report and mem report does not give
>    any clue about the problem. But network report shows something unusual: the
>    in speed decreases a little while the out speed decreases to nearly zero on
>    some workers.
>    3. I've logged in to one of machines mentioned above, and found out
>    that one of the survivor areas always remains 100% full.
>    4. dstat show that csw turns to 4k+ every few seconds while it remains
>    around 400 in normal condition.
>
> Can anyone give us some hint about this problem?
>
>
>
>
>
>
> --
> Renjie Liu
> Department of Computer Science & Engineering
> Shanghai JiaoTong University
>
>
>

Re: Strange storm problem

Posted by Renjie Liu <li...@gmail.com>.

Yes, we need to write result to hbase but they are written out
asynchronously.

On Sun, Nov 1, 2015 at 11:49 PM, Harsha <st...@harsha.io> wrote:

> Do you have any calls to external data sources which might be increasing
> the latency and causing tuple timeout?
>
>
> On Sun, Nov 1, 2015, at 04:49 AM, Renjie Liu wrote:
>
> Yes, I've set it to 20000
>
> On Sun, Nov 1, 2015 at 6:40 PM, Santosh Pingale <pi...@gmail.com>
> wrote:
>
> Have you set 'topology.*max*.*spout*.*pending'?*
>
> On Sun, Nov 1, 2015 at 2:26 PM, Renjie Liu <li...@gmail.com>
> wrote:
>
> Hi, storm community:
>
> We have a storm cluster deployed with 15 workers and recently we often
> experience failure since ack timeout. Our input source is kafka and we used
> ganglia to monitor our cluster. Recently we experience failures every 12
> hours and following are my observations from some monitoring tools when
> problem happens:
>
>    1. Topology page shows that no worker was down since uptime of each
>    task are nearly equal to topology uptime
>    2. I've checked ganglia, the cpu report and mem report does not give
>    any clue about the problem. But network report shows something unusual: the
>    in speed decreases a little while the out speed decreases to nearly zero on
>    some workers.
>    3. I've logged in to one of machines mentioned above, and found out
>    that one of the survivor areas always remains 100% full.
>    4. dstat show that csw turns to 4k+ every few seconds while it remains
>    around 400 in normal condition.
>
> Can anyone give us some hint about this problem?
>
>
>
>
>
>
> --
> Renjie Liu
> Department of Computer Science & Engineering
> Shanghai JiaoTong University
>
>
>



-- 
Renjie Liu
Department of Computer Science & Engineering
Shanghai JiaoTong University

Re: Strange storm problem

Posted by Harsha <st...@harsha.io>.

Do you have any calls to external data sources which might be increasing
the latency and causing tuple timeout?


On Sun, Nov 1, 2015, at 04:49 AM, Renjie Liu wrote:
> Yes, I've set it to 20000
>
> On Sun, Nov 1, 2015 at 6:40 PM, Santosh Pingale
> <pi...@gmail.com> wrote:
>> Have you set 'topology.*max*.*spout*.*pending'?*
>>
>> On Sun, Nov 1, 2015 at 2:26 PM, Renjie Liu
>> <li...@gmail.com> wrote:
>>> Hi, storm community:
>>>
>>> We have a storm cluster deployed with 15 workers and recently we
>>> often experience failure since ack timeout. Our input source is
>>> kafka and we used ganglia to monitor our cluster.�Recently we
>>> experience failures every 12 hours and following are my observations
>>> from some monitoring tools when problem happens:
>>>  1. Topology page shows that no worker was down since uptime of each
>>>     task are nearly equal to topology uptime
>>>  2. I've checked ganglia, the cpu report and mem report does not
>>>     give any clue about the problem. But network report shows
>>>     something unusual: the in speed decreases a little while the out
>>>     speed decreases to nearly zero on some workers.
>>>  3. I've logged in to one of machines mentioned above, and found out
>>>     that one of the survivor areas always remains 100% full.
>>>  4. dstat show that csw turns to 4k+ every few seconds while it
>>>     remains around 400 in normal condition. Can anyone give us some
>>>     hint about this problem?
>>
>
>
>
> --
> Renjie Liu Department of Computer Science & Engineering Shanghai
> JiaoTong University

Re: Strange storm problem

Posted by Renjie Liu <li...@gmail.com>.

Yes, I've set it to 20000

On Sun, Nov 1, 2015 at 6:40 PM, Santosh Pingale <pi...@gmail.com>
wrote:

> Have you set 'topology.max.spout.pending'?
>
> On Sun, Nov 1, 2015 at 2:26 PM, Renjie Liu <li...@gmail.com>
> wrote:
>
>> Hi, storm community:
>>
>> We have a storm cluster deployed with 15 workers and recently we often
>> experience failure since ack timeout. Our input source is kafka and we used
>> ganglia to monitor our cluster. Recently we experience failures every 12
>> hours and following are my observations from some monitoring tools when
>> problem happens:
>>
>>    1. Topology page shows that no worker was down since uptime of each
>>    task are nearly equal to topology uptime
>>    2. I've checked ganglia, the cpu report and mem report does not give
>>    any clue about the problem. But network report shows something unusual: the
>>    in speed decreases a little while the out speed decreases to nearly zero on
>>    some workers.
>>    3. I've logged in to one of machines mentioned above, and found out
>>    that one of the survivor areas always remains 100% full.
>>    4. dstat show that csw turns to 4k+ every few seconds while it
>>    remains around 400 in normal condition.
>>
>> Can anyone give us some hint about this problem?
>>
>
>


-- 
Renjie Liu
Department of Computer Science & Engineering
Shanghai JiaoTong University

Re: Strange storm problem

Posted by Santosh Pingale <pi...@gmail.com>.

Have you set 'topology.max.spout.pending'?

On Sun, Nov 1, 2015 at 2:26 PM, Renjie Liu <li...@gmail.com> wrote:

> Hi, storm community:
>
> We have a storm cluster deployed with 15 workers and recently we often
> experience failure since ack timeout. Our input source is kafka and we used
> ganglia to monitor our cluster. Recently we experience failures every 12
> hours and following are my observations from some monitoring tools when
> problem happens:
>
>    1. Topology page shows that no worker was down since uptime of each
>    task are nearly equal to topology uptime
>    2. I've checked ganglia, the cpu report and mem report does not give
>    any clue about the problem. But network report shows something unusual: the
>    in speed decreases a little while the out speed decreases to nearly zero on
>    some workers.
>    3. I've logged in to one of machines mentioned above, and found out
>    that one of the survivor areas always remains 100% full.
>    4. dstat show that csw turns to 4k+ every few seconds while it remains
>    around 400 in normal condition.
>
> Can anyone give us some hint about this problem?
>