You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Fang Chen <fc...@gmail.com> on 2015/06/12 06:51:53 UTC

Has anybody successfully run storm 0.9+ in production under reasonable load?

We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried 0.9.5
yet but I don't see any significant differences there), and unfortunately
we could not even have a clean run for over 30 minutes on a cluster of 5
high-end nodes. zookeeper is also set up on these nodes but on different
disks.

I have huge troubles to give my data analytics topology a stable run. So I
tried the simplest topology I can think of, just an emtpy bolt, no io
except for reading from kafka queue.

Just to report my latest testing on 0.9.4 with this empty bolt (kakfa topic
partition=1, spout task #=1, bolt #=20 with field grouping, msg size=1k).
After 26 minutes, nimbus orders to kill the topology as it believe the
topology is dead, then after another 2 minutes, another kill, then another
after another 4 minutes, and on and on.

I can understand there might be issues in the coordination among nimbus,
worker and executor (e.g., heartbeats). But are there any doable
workarounds? I wish there are as so many of you are using it in production
:-)

I deeply appreciate any suggestions that could even make my toy topology
working!

Fang

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Erik Weathers <ew...@groupon.com>.
Yes, the netty errors from a large set of worker deaths really obscure the
original root cause.  Again you need to diagnose that.

- Erik

On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:

> Forgot to add, one complication of this problem is that, after several
> rounds of killing, workers re-spawned can no longer talk to their peers,
> with all sorts of netty exceptions.
>
> On Thu, Jun 11, 2015 at 9:51 PM, Fang Chen <fc2004@gmail.com
> <javascript:_e(%7B%7D,'cvml','fc2004@gmail.com');>> wrote:
>
>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
>> 0.9.5 yet but I don't see any significant differences there), and
>> unfortunately we could not even have a clean run for over 30 minutes on a
>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>> different disks.
>>
>> I have huge troubles to give my data analytics topology a stable run. So
>> I tried the simplest topology I can think of, just an emtpy bolt, no io
>> except for reading from kafka queue.
>>
>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
>> size=1k).
>> After 26 minutes, nimbus orders to kill the topology as it believe the
>> topology is dead, then after another 2 minutes, another kill, then another
>> after another 4 minutes, and on and on.
>>
>> I can understand there might be issues in the coordination among nimbus,
>> worker and executor (e.g., heartbeats). But are there any doable
>> workarounds? I wish there are as so many of you are using it in production
>> :-)
>>
>> I deeply appreciate any suggestions that could even make my toy topology
>> working!
>>
>> Fang
>>
>>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Fang Chen <fc...@gmail.com>.
Forgot to add, one complication of this problem is that, after several
rounds of killing, workers re-spawned can no longer talk to their peers,
with all sorts of netty exceptions.

On Thu, Jun 11, 2015 at 9:51 PM, Fang Chen <fc...@gmail.com> wrote:

> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
> 0.9.5 yet but I don't see any significant differences there), and
> unfortunately we could not even have a clean run for over 30 minutes on a
> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
> different disks.
>
> I have huge troubles to give my data analytics topology a stable run. So I
> tried the simplest topology I can think of, just an emtpy bolt, no io
> except for reading from kafka queue.
>
> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
> size=1k).
> After 26 minutes, nimbus orders to kill the topology as it believe the
> topology is dead, then after another 2 minutes, another kill, then another
> after another 4 minutes, and on and on.
>
> I can understand there might be issues in the coordination among nimbus,
> worker and executor (e.g., heartbeats). But are there any doable
> workarounds? I wish there are as so many of you are using it in production
> :-)
>
> I deeply appreciate any suggestions that could even make my toy topology
> working!
>
> Fang
>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Devang Shah <de...@gmail.com>.
What's your max spout pending value for the topology ?

Also observe the CPU usage, like how many cycles it is spending on the
process.

Thanks and Regards,
Devang
On 19 Jun 2015 02:46, "Fang Chen" <fc...@gmail.com> wrote:

> tried. no effect.
>
> Thanks,
> Fang
>
> On Mon, Jun 15, 2015 at 9:33 PM, Binh Nguyen Van <bi...@gmail.com>
> wrote:
>
>> Can you try this.
>>
>> Remove -XX:+CMSScavengeBeforeRemark flag and reduce your heap size so
>> that YGC happen once every 2-3 seconds?
>> If that fix the issue then I think GC is the cause of your problem.
>>
>> On Mon, Jun 15, 2015 at 11:56 AM, Fang Chen <fc...@gmail.com> wrote:
>>
>>> We use storm bare bones, not trident as it's too expensive for our use
>>> cases.  The jvm options for supervisor is listed below but it might not be
>>> optimal in any sense.
>>>
>>> supervisor.childopts: "-Xms2G -Xmx2G -XX:NewSize=1G -XX:+UseParNewGC
>>> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=6
>>> -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly
>>> -XX:+UseTLAB -XX:+UseCondCardMark -XX:CMSWaitDuration=5000
>>> -XX:+CMSScavengeBeforeRemark -XX:+UnlockDiagnosticVMOptions
>>> -XX:ParGCCardsPerStrideChunk=4096 -XX:+ExplicitGCInvokesConcurrent
>>> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution
>>> -XX:PrintFLSStatistics=1 -XX:+PrintPromotionFailure
>>> -XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC
>>> -XX:+PrintSafepointStatistics -Xloggc:/usr/local/storm/logs/gc.log
>>> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10
>>> -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote.port=9998
>>> -Dcom.sun.management.jmxremote.ssl=false
>>> -Dcom.sun.management.jmxremote.authenticate=false"
>>>
>>>
>>>
>>> On Mon, Jun 15, 2015 at 11:48 AM, Binh Nguyen Van <bi...@gmail.com>
>>> wrote:
>>>
>>>> Just to be sure, are you using Storm or Storm Trident?
>>>> Also can you share the current setting of your supervisor.child_opts?
>>>>
>>>> On Mon, Jun 15, 2015 at 11:39 AM, Fang Chen <fc...@gmail.com> wrote:
>>>>
>>>>> I did enable gc for both worker and supervisor and found nothing
>>>>> abnormal (pause is minimal and frequency is normal too).  I tried max
>>>>> spound pending of both 1000 and 500.
>>>>>
>>>>> Fang
>>>>>
>>>>> On Mon, Jun 15, 2015 at 11:36 AM, Binh Nguyen Van <bi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Fang,
>>>>>>
>>>>>> Did you check your GC log? Do you see anything abnormal?
>>>>>> What is your current max spout pending setting?
>>>>>>
>>>>>> On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <fc...@gmail.com> wrote:
>>>>>>
>>>>>>> I also did this and find no success.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Fang
>>>>>>>
>>>>>>> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <nc...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> After I wrote that I realized you tried empty topology anyways.
>>>>>>>> This should reduce any gc or worker initialization related failures though
>>>>>>>> they are still possible.  As Erik mentioned check ZK.  Also I'm not sure if
>>>>>>>> this is still required but it used to be helpful to make sure your storm
>>>>>>>> nodes have each other listed in /etc/hosts.
>>>>>>>> On Jun 12, 2015 8:59 AM, "Nathan Leung" <nc...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Make sure your topology is starting up in the allotted time, and
>>>>>>>>> if not try increasing the startup timeout.
>>>>>>>>> On Jun 12, 2015 2:46 AM, "Fang Chen" <fc...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Erik
>>>>>>>>>>
>>>>>>>>>> Thanks for your reply!  It's great to hear about real production
>>>>>>>>>> usages. For our use case, we are really puzzled by the outcome so far. The
>>>>>>>>>> initial investigation seems to indicate that workers don't die by
>>>>>>>>>> themselves ( i actually tried killing the supervisor and the worker would
>>>>>>>>>> continue running beyond 30 minutes).
>>>>>>>>>>
>>>>>>>>>> The sequence of events is like this:  supervisor immediately
>>>>>>>>>> complains worker "still has not started" for a few seconds right after
>>>>>>>>>> launching the worker process, then silent --> after 26 minutes, nimbus
>>>>>>>>>> complains executors (related to the worker) "not alive" and started to
>>>>>>>>>> reassign topology --> after another ~500 milliseconds, the supervisor shuts
>>>>>>>>>> down its worker --> other peer workers complain about netty issues. and the
>>>>>>>>>> loop goes on.
>>>>>>>>>>
>>>>>>>>>> Could you kindly tell me what version of zookeeper is used with
>>>>>>>>>> 0.9.4? and how many nodes in the zookeeper cluster?
>>>>>>>>>>
>>>>>>>>>> I wonder if this is due to zookeeper issues.
>>>>>>>>>>
>>>>>>>>>> Thanks a lot,
>>>>>>>>>> Fang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <
>>>>>>>>>> eweathers@groupon.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Fang,
>>>>>>>>>>>
>>>>>>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and
>>>>>>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of
>>>>>>>>>>> 30+ nodes.
>>>>>>>>>>>
>>>>>>>>>>> One of the challenges with storm is figuring out what the root
>>>>>>>>>>> cause is when things go haywire.  You'll wanna examine why the nimbus
>>>>>>>>>>> decided to restart your worker processes.  It would happen when workers die
>>>>>>>>>>> and the nimbus notices that storm executors aren't alive.  (There are logs
>>>>>>>>>>> in nimbus for this.)  Then you'll wanna dig into why the workers died by
>>>>>>>>>>> looking at logs on the worker hosts.
>>>>>>>>>>>
>>>>>>>>>>> - Erik
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not
>>>>>>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), and
>>>>>>>>>>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>>>>>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>>>>>>>>>>> different disks.
>>>>>>>>>>>>
>>>>>>>>>>>> I have huge troubles to give my data analytics topology a
>>>>>>>>>>>> stable run. So I tried the simplest topology I can think of, just an emtpy
>>>>>>>>>>>> bolt, no io except for reading from kafka queue.
>>>>>>>>>>>>
>>>>>>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt
>>>>>>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field grouping,
>>>>>>>>>>>> msg size=1k).
>>>>>>>>>>>> After 26 minutes, nimbus orders to kill the topology as it
>>>>>>>>>>>> believe the topology is dead, then after another 2 minutes, another kill,
>>>>>>>>>>>> then another after another 4 minutes, and on and on.
>>>>>>>>>>>>
>>>>>>>>>>>> I can understand there might be issues in the coordination
>>>>>>>>>>>> among nimbus, worker and executor (e.g., heartbeats). But are there any
>>>>>>>>>>>> doable workarounds? I wish there are as so many of you are using it in
>>>>>>>>>>>> production :-)
>>>>>>>>>>>>
>>>>>>>>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>>>>>>>>> topology working!
>>>>>>>>>>>>
>>>>>>>>>>>> Fang
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Fang Chen <fc...@gmail.com>.
tried. no effect.

Thanks,
Fang

On Mon, Jun 15, 2015 at 9:33 PM, Binh Nguyen Van <bi...@gmail.com> wrote:

> Can you try this.
>
> Remove -XX:+CMSScavengeBeforeRemark flag and reduce your heap size so
> that YGC happen once every 2-3 seconds?
> If that fix the issue then I think GC is the cause of your problem.
>
> On Mon, Jun 15, 2015 at 11:56 AM, Fang Chen <fc...@gmail.com> wrote:
>
>> We use storm bare bones, not trident as it's too expensive for our use
>> cases.  The jvm options for supervisor is listed below but it might not be
>> optimal in any sense.
>>
>> supervisor.childopts: "-Xms2G -Xmx2G -XX:NewSize=1G -XX:+UseParNewGC
>> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=6
>> -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly
>> -XX:+UseTLAB -XX:+UseCondCardMark -XX:CMSWaitDuration=5000
>> -XX:+CMSScavengeBeforeRemark -XX:+UnlockDiagnosticVMOptions
>> -XX:ParGCCardsPerStrideChunk=4096 -XX:+ExplicitGCInvokesConcurrent
>> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution
>> -XX:PrintFLSStatistics=1 -XX:+PrintPromotionFailure
>> -XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC
>> -XX:+PrintSafepointStatistics -Xloggc:/usr/local/storm/logs/gc.log
>> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10
>> -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote.port=9998
>> -Dcom.sun.management.jmxremote.ssl=false
>> -Dcom.sun.management.jmxremote.authenticate=false"
>>
>>
>>
>> On Mon, Jun 15, 2015 at 11:48 AM, Binh Nguyen Van <bi...@gmail.com>
>> wrote:
>>
>>> Just to be sure, are you using Storm or Storm Trident?
>>> Also can you share the current setting of your supervisor.child_opts?
>>>
>>> On Mon, Jun 15, 2015 at 11:39 AM, Fang Chen <fc...@gmail.com> wrote:
>>>
>>>> I did enable gc for both worker and supervisor and found nothing
>>>> abnormal (pause is minimal and frequency is normal too).  I tried max
>>>> spound pending of both 1000 and 500.
>>>>
>>>> Fang
>>>>
>>>> On Mon, Jun 15, 2015 at 11:36 AM, Binh Nguyen Van <bi...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Fang,
>>>>>
>>>>> Did you check your GC log? Do you see anything abnormal?
>>>>> What is your current max spout pending setting?
>>>>>
>>>>> On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <fc...@gmail.com> wrote:
>>>>>
>>>>>> I also did this and find no success.
>>>>>>
>>>>>> Thanks,
>>>>>> Fang
>>>>>>
>>>>>> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <nc...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> After I wrote that I realized you tried empty topology anyways.
>>>>>>> This should reduce any gc or worker initialization related failures though
>>>>>>> they are still possible.  As Erik mentioned check ZK.  Also I'm not sure if
>>>>>>> this is still required but it used to be helpful to make sure your storm
>>>>>>> nodes have each other listed in /etc/hosts.
>>>>>>> On Jun 12, 2015 8:59 AM, "Nathan Leung" <nc...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Make sure your topology is starting up in the allotted time, and if
>>>>>>>> not try increasing the startup timeout.
>>>>>>>> On Jun 12, 2015 2:46 AM, "Fang Chen" <fc...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Erik
>>>>>>>>>
>>>>>>>>> Thanks for your reply!  It's great to hear about real production
>>>>>>>>> usages. For our use case, we are really puzzled by the outcome so far. The
>>>>>>>>> initial investigation seems to indicate that workers don't die by
>>>>>>>>> themselves ( i actually tried killing the supervisor and the worker would
>>>>>>>>> continue running beyond 30 minutes).
>>>>>>>>>
>>>>>>>>> The sequence of events is like this:  supervisor immediately
>>>>>>>>> complains worker "still has not started" for a few seconds right after
>>>>>>>>> launching the worker process, then silent --> after 26 minutes, nimbus
>>>>>>>>> complains executors (related to the worker) "not alive" and started to
>>>>>>>>> reassign topology --> after another ~500 milliseconds, the supervisor shuts
>>>>>>>>> down its worker --> other peer workers complain about netty issues. and the
>>>>>>>>> loop goes on.
>>>>>>>>>
>>>>>>>>> Could you kindly tell me what version of zookeeper is used with
>>>>>>>>> 0.9.4? and how many nodes in the zookeeper cluster?
>>>>>>>>>
>>>>>>>>> I wonder if this is due to zookeeper issues.
>>>>>>>>>
>>>>>>>>> Thanks a lot,
>>>>>>>>> Fang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <
>>>>>>>>> eweathers@groupon.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Fang,
>>>>>>>>>>
>>>>>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and
>>>>>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of
>>>>>>>>>> 30+ nodes.
>>>>>>>>>>
>>>>>>>>>> One of the challenges with storm is figuring out what the root
>>>>>>>>>> cause is when things go haywire.  You'll wanna examine why the nimbus
>>>>>>>>>> decided to restart your worker processes.  It would happen when workers die
>>>>>>>>>> and the nimbus notices that storm executors aren't alive.  (There are logs
>>>>>>>>>> in nimbus for this.)  Then you'll wanna dig into why the workers died by
>>>>>>>>>> looking at logs on the worker hosts.
>>>>>>>>>>
>>>>>>>>>> - Erik
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not
>>>>>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), and
>>>>>>>>>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>>>>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>>>>>>>>>> different disks.
>>>>>>>>>>>
>>>>>>>>>>> I have huge troubles to give my data analytics topology a stable
>>>>>>>>>>> run. So I tried the simplest topology I can think of, just an emtpy bolt,
>>>>>>>>>>> no io except for reading from kafka queue.
>>>>>>>>>>>
>>>>>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt
>>>>>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field grouping,
>>>>>>>>>>> msg size=1k).
>>>>>>>>>>> After 26 minutes, nimbus orders to kill the topology as it
>>>>>>>>>>> believe the topology is dead, then after another 2 minutes, another kill,
>>>>>>>>>>> then another after another 4 minutes, and on and on.
>>>>>>>>>>>
>>>>>>>>>>> I can understand there might be issues in the coordination among
>>>>>>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable
>>>>>>>>>>> workarounds? I wish there are as so many of you are using it in production
>>>>>>>>>>> :-)
>>>>>>>>>>>
>>>>>>>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>>>>>>>> topology working!
>>>>>>>>>>>
>>>>>>>>>>> Fang
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Binh Nguyen Van <bi...@gmail.com>.
Can you try this.

Remove -XX:+CMSScavengeBeforeRemark flag and reduce your heap size so that
YGC happen once every 2-3 seconds?
If that fix the issue then I think GC is the cause of your problem.

On Mon, Jun 15, 2015 at 11:56 AM, Fang Chen <fc...@gmail.com> wrote:

> We use storm bare bones, not trident as it's too expensive for our use
> cases.  The jvm options for supervisor is listed below but it might not be
> optimal in any sense.
>
> supervisor.childopts: "-Xms2G -Xmx2G -XX:NewSize=1G -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=6
> -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+UseTLAB -XX:+UseCondCardMark -XX:CMSWaitDuration=5000
> -XX:+CMSScavengeBeforeRemark -XX:+UnlockDiagnosticVMOptions
> -XX:ParGCCardsPerStrideChunk=4096 -XX:+ExplicitGCInvokesConcurrent
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution
> -XX:PrintFLSStatistics=1 -XX:+PrintPromotionFailure
> -XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC
> -XX:+PrintSafepointStatistics -Xloggc:/usr/local/storm/logs/gc.log
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10
> -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote.port=9998
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.authenticate=false"
>
>
>
> On Mon, Jun 15, 2015 at 11:48 AM, Binh Nguyen Van <bi...@gmail.com>
> wrote:
>
>> Just to be sure, are you using Storm or Storm Trident?
>> Also can you share the current setting of your supervisor.child_opts?
>>
>> On Mon, Jun 15, 2015 at 11:39 AM, Fang Chen <fc...@gmail.com> wrote:
>>
>>> I did enable gc for both worker and supervisor and found nothing
>>> abnormal (pause is minimal and frequency is normal too).  I tried max
>>> spound pending of both 1000 and 500.
>>>
>>> Fang
>>>
>>> On Mon, Jun 15, 2015 at 11:36 AM, Binh Nguyen Van <bi...@gmail.com>
>>> wrote:
>>>
>>>> Hi Fang,
>>>>
>>>> Did you check your GC log? Do you see anything abnormal?
>>>> What is your current max spout pending setting?
>>>>
>>>> On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <fc...@gmail.com> wrote:
>>>>
>>>>> I also did this and find no success.
>>>>>
>>>>> Thanks,
>>>>> Fang
>>>>>
>>>>> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <nc...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> After I wrote that I realized you tried empty topology anyways.  This
>>>>>> should reduce any gc or worker initialization related failures though they
>>>>>> are still possible.  As Erik mentioned check ZK.  Also I'm not sure if this
>>>>>> is still required but it used to be helpful to make sure your storm nodes
>>>>>> have each other listed in /etc/hosts.
>>>>>> On Jun 12, 2015 8:59 AM, "Nathan Leung" <nc...@gmail.com> wrote:
>>>>>>
>>>>>>> Make sure your topology is starting up in the allotted time, and if
>>>>>>> not try increasing the startup timeout.
>>>>>>> On Jun 12, 2015 2:46 AM, "Fang Chen" <fc...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Erik
>>>>>>>>
>>>>>>>> Thanks for your reply!  It's great to hear about real production
>>>>>>>> usages. For our use case, we are really puzzled by the outcome so far. The
>>>>>>>> initial investigation seems to indicate that workers don't die by
>>>>>>>> themselves ( i actually tried killing the supervisor and the worker would
>>>>>>>> continue running beyond 30 minutes).
>>>>>>>>
>>>>>>>> The sequence of events is like this:  supervisor immediately
>>>>>>>> complains worker "still has not started" for a few seconds right after
>>>>>>>> launching the worker process, then silent --> after 26 minutes, nimbus
>>>>>>>> complains executors (related to the worker) "not alive" and started to
>>>>>>>> reassign topology --> after another ~500 milliseconds, the supervisor shuts
>>>>>>>> down its worker --> other peer workers complain about netty issues. and the
>>>>>>>> loop goes on.
>>>>>>>>
>>>>>>>> Could you kindly tell me what version of zookeeper is used with
>>>>>>>> 0.9.4? and how many nodes in the zookeeper cluster?
>>>>>>>>
>>>>>>>> I wonder if this is due to zookeeper issues.
>>>>>>>>
>>>>>>>> Thanks a lot,
>>>>>>>> Fang
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <
>>>>>>>> eweathers@groupon.com> wrote:
>>>>>>>>
>>>>>>>>> Hey Fang,
>>>>>>>>>
>>>>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and
>>>>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of
>>>>>>>>> 30+ nodes.
>>>>>>>>>
>>>>>>>>> One of the challenges with storm is figuring out what the root
>>>>>>>>> cause is when things go haywire.  You'll wanna examine why the nimbus
>>>>>>>>> decided to restart your worker processes.  It would happen when workers die
>>>>>>>>> and the nimbus notices that storm executors aren't alive.  (There are logs
>>>>>>>>> in nimbus for this.)  Then you'll wanna dig into why the workers died by
>>>>>>>>> looking at logs on the worker hosts.
>>>>>>>>>
>>>>>>>>> - Erik
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not
>>>>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), and
>>>>>>>>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>>>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>>>>>>>>> different disks.
>>>>>>>>>>
>>>>>>>>>> I have huge troubles to give my data analytics topology a stable
>>>>>>>>>> run. So I tried the simplest topology I can think of, just an emtpy bolt,
>>>>>>>>>> no io except for reading from kafka queue.
>>>>>>>>>>
>>>>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt
>>>>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field grouping,
>>>>>>>>>> msg size=1k).
>>>>>>>>>> After 26 minutes, nimbus orders to kill the topology as it
>>>>>>>>>> believe the topology is dead, then after another 2 minutes, another kill,
>>>>>>>>>> then another after another 4 minutes, and on and on.
>>>>>>>>>>
>>>>>>>>>> I can understand there might be issues in the coordination among
>>>>>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable
>>>>>>>>>> workarounds? I wish there are as so many of you are using it in production
>>>>>>>>>> :-)
>>>>>>>>>>
>>>>>>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>>>>>>> topology working!
>>>>>>>>>>
>>>>>>>>>> Fang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Fang Chen <fc...@gmail.com>.
We use storm bare bones, not trident as it's too expensive for our use
cases.  The jvm options for supervisor is listed below but it might not be
optimal in any sense.

supervisor.childopts: "-Xms2G -Xmx2G -XX:NewSize=1G -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=6
-XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseTLAB -XX:+UseCondCardMark -XX:CMSWaitDuration=5000
-XX:+CMSScavengeBeforeRemark -XX:+UnlockDiagnosticVMOptions
-XX:ParGCCardsPerStrideChunk=4096 -XX:+ExplicitGCInvokesConcurrent
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution
-XX:PrintFLSStatistics=1 -XX:+PrintPromotionFailure
-XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC
-XX:+PrintSafepointStatistics -Xloggc:/usr/local/storm/logs/gc.log
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote.port=9998
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false"



On Mon, Jun 15, 2015 at 11:48 AM, Binh Nguyen Van <bi...@gmail.com>
wrote:

> Just to be sure, are you using Storm or Storm Trident?
> Also can you share the current setting of your supervisor.child_opts?
>
> On Mon, Jun 15, 2015 at 11:39 AM, Fang Chen <fc...@gmail.com> wrote:
>
>> I did enable gc for both worker and supervisor and found nothing abnormal
>> (pause is minimal and frequency is normal too).  I tried max spound pending
>> of both 1000 and 500.
>>
>> Fang
>>
>> On Mon, Jun 15, 2015 at 11:36 AM, Binh Nguyen Van <bi...@gmail.com>
>> wrote:
>>
>>> Hi Fang,
>>>
>>> Did you check your GC log? Do you see anything abnormal?
>>> What is your current max spout pending setting?
>>>
>>> On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <fc...@gmail.com> wrote:
>>>
>>>> I also did this and find no success.
>>>>
>>>> Thanks,
>>>> Fang
>>>>
>>>> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <nc...@gmail.com>
>>>> wrote:
>>>>
>>>>> After I wrote that I realized you tried empty topology anyways.  This
>>>>> should reduce any gc or worker initialization related failures though they
>>>>> are still possible.  As Erik mentioned check ZK.  Also I'm not sure if this
>>>>> is still required but it used to be helpful to make sure your storm nodes
>>>>> have each other listed in /etc/hosts.
>>>>> On Jun 12, 2015 8:59 AM, "Nathan Leung" <nc...@gmail.com> wrote:
>>>>>
>>>>>> Make sure your topology is starting up in the allotted time, and if
>>>>>> not try increasing the startup timeout.
>>>>>> On Jun 12, 2015 2:46 AM, "Fang Chen" <fc...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Erik
>>>>>>>
>>>>>>> Thanks for your reply!  It's great to hear about real production
>>>>>>> usages. For our use case, we are really puzzled by the outcome so far. The
>>>>>>> initial investigation seems to indicate that workers don't die by
>>>>>>> themselves ( i actually tried killing the supervisor and the worker would
>>>>>>> continue running beyond 30 minutes).
>>>>>>>
>>>>>>> The sequence of events is like this:  supervisor immediately
>>>>>>> complains worker "still has not started" for a few seconds right after
>>>>>>> launching the worker process, then silent --> after 26 minutes, nimbus
>>>>>>> complains executors (related to the worker) "not alive" and started to
>>>>>>> reassign topology --> after another ~500 milliseconds, the supervisor shuts
>>>>>>> down its worker --> other peer workers complain about netty issues. and the
>>>>>>> loop goes on.
>>>>>>>
>>>>>>> Could you kindly tell me what version of zookeeper is used with
>>>>>>> 0.9.4? and how many nodes in the zookeeper cluster?
>>>>>>>
>>>>>>> I wonder if this is due to zookeeper issues.
>>>>>>>
>>>>>>> Thanks a lot,
>>>>>>> Fang
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <
>>>>>>> eweathers@groupon.com> wrote:
>>>>>>>
>>>>>>>> Hey Fang,
>>>>>>>>
>>>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and
>>>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of 30+
>>>>>>>> nodes.
>>>>>>>>
>>>>>>>> One of the challenges with storm is figuring out what the root
>>>>>>>> cause is when things go haywire.  You'll wanna examine why the nimbus
>>>>>>>> decided to restart your worker processes.  It would happen when workers die
>>>>>>>> and the nimbus notices that storm executors aren't alive.  (There are logs
>>>>>>>> in nimbus for this.)  Then you'll wanna dig into why the workers died by
>>>>>>>> looking at logs on the worker hosts.
>>>>>>>>
>>>>>>>> - Erik
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not
>>>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), and
>>>>>>>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>>>>>>>> different disks.
>>>>>>>>>
>>>>>>>>> I have huge troubles to give my data analytics topology a stable
>>>>>>>>> run. So I tried the simplest topology I can think of, just an emtpy bolt,
>>>>>>>>> no io except for reading from kafka queue.
>>>>>>>>>
>>>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt
>>>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field grouping,
>>>>>>>>> msg size=1k).
>>>>>>>>> After 26 minutes, nimbus orders to kill the topology as it believe
>>>>>>>>> the topology is dead, then after another 2 minutes, another kill, then
>>>>>>>>> another after another 4 minutes, and on and on.
>>>>>>>>>
>>>>>>>>> I can understand there might be issues in the coordination among
>>>>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable
>>>>>>>>> workarounds? I wish there are as so many of you are using it in production
>>>>>>>>> :-)
>>>>>>>>>
>>>>>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>>>>>> topology working!
>>>>>>>>>
>>>>>>>>> Fang
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>
>>>
>>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Binh Nguyen Van <bi...@gmail.com>.
Just to be sure, are you using Storm or Storm Trident?
Also can you share the current setting of your supervisor.child_opts?

On Mon, Jun 15, 2015 at 11:39 AM, Fang Chen <fc...@gmail.com> wrote:

> I did enable gc for both worker and supervisor and found nothing abnormal
> (pause is minimal and frequency is normal too).  I tried max spound pending
> of both 1000 and 500.
>
> Fang
>
> On Mon, Jun 15, 2015 at 11:36 AM, Binh Nguyen Van <bi...@gmail.com>
> wrote:
>
>> Hi Fang,
>>
>> Did you check your GC log? Do you see anything abnormal?
>> What is your current max spout pending setting?
>>
>> On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <fc...@gmail.com> wrote:
>>
>>> I also did this and find no success.
>>>
>>> Thanks,
>>> Fang
>>>
>>> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <nc...@gmail.com> wrote:
>>>
>>>> After I wrote that I realized you tried empty topology anyways.  This
>>>> should reduce any gc or worker initialization related failures though they
>>>> are still possible.  As Erik mentioned check ZK.  Also I'm not sure if this
>>>> is still required but it used to be helpful to make sure your storm nodes
>>>> have each other listed in /etc/hosts.
>>>> On Jun 12, 2015 8:59 AM, "Nathan Leung" <nc...@gmail.com> wrote:
>>>>
>>>>> Make sure your topology is starting up in the allotted time, and if
>>>>> not try increasing the startup timeout.
>>>>> On Jun 12, 2015 2:46 AM, "Fang Chen" <fc...@gmail.com> wrote:
>>>>>
>>>>>> Hi Erik
>>>>>>
>>>>>> Thanks for your reply!  It's great to hear about real production
>>>>>> usages. For our use case, we are really puzzled by the outcome so far. The
>>>>>> initial investigation seems to indicate that workers don't die by
>>>>>> themselves ( i actually tried killing the supervisor and the worker would
>>>>>> continue running beyond 30 minutes).
>>>>>>
>>>>>> The sequence of events is like this:  supervisor immediately
>>>>>> complains worker "still has not started" for a few seconds right after
>>>>>> launching the worker process, then silent --> after 26 minutes, nimbus
>>>>>> complains executors (related to the worker) "not alive" and started to
>>>>>> reassign topology --> after another ~500 milliseconds, the supervisor shuts
>>>>>> down its worker --> other peer workers complain about netty issues. and the
>>>>>> loop goes on.
>>>>>>
>>>>>> Could you kindly tell me what version of zookeeper is used with
>>>>>> 0.9.4? and how many nodes in the zookeeper cluster?
>>>>>>
>>>>>> I wonder if this is due to zookeeper issues.
>>>>>>
>>>>>> Thanks a lot,
>>>>>> Fang
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <
>>>>>> eweathers@groupon.com> wrote:
>>>>>>
>>>>>>> Hey Fang,
>>>>>>>
>>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and
>>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of 30+
>>>>>>> nodes.
>>>>>>>
>>>>>>> One of the challenges with storm is figuring out what the root cause
>>>>>>> is when things go haywire.  You'll wanna examine why the nimbus decided to
>>>>>>> restart your worker processes.  It would happen when workers die and the
>>>>>>> nimbus notices that storm executors aren't alive.  (There are logs in
>>>>>>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>>>>>>> looking at logs on the worker hosts.
>>>>>>>
>>>>>>> - Erik
>>>>>>>
>>>>>>>
>>>>>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>>>>>
>>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not
>>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), and
>>>>>>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>>>>>>> different disks.
>>>>>>>>
>>>>>>>> I have huge troubles to give my data analytics topology a stable
>>>>>>>> run. So I tried the simplest topology I can think of, just an emtpy bolt,
>>>>>>>> no io except for reading from kafka queue.
>>>>>>>>
>>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt
>>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field grouping,
>>>>>>>> msg size=1k).
>>>>>>>> After 26 minutes, nimbus orders to kill the topology as it believe
>>>>>>>> the topology is dead, then after another 2 minutes, another kill, then
>>>>>>>> another after another 4 minutes, and on and on.
>>>>>>>>
>>>>>>>> I can understand there might be issues in the coordination among
>>>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable
>>>>>>>> workarounds? I wish there are as so many of you are using it in production
>>>>>>>> :-)
>>>>>>>>
>>>>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>>>>> topology working!
>>>>>>>>
>>>>>>>> Fang
>>>>>>>>
>>>>>>>>
>>>>>>
>>>
>>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Fang Chen <fc...@gmail.com>.
I did enable gc for both worker and supervisor and found nothing abnormal
(pause is minimal and frequency is normal too).  I tried max spound pending
of both 1000 and 500.

Fang

On Mon, Jun 15, 2015 at 11:36 AM, Binh Nguyen Van <bi...@gmail.com>
wrote:

> Hi Fang,
>
> Did you check your GC log? Do you see anything abnormal?
> What is your current max spout pending setting?
>
> On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <fc...@gmail.com> wrote:
>
>> I also did this and find no success.
>>
>> Thanks,
>> Fang
>>
>> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <nc...@gmail.com> wrote:
>>
>>> After I wrote that I realized you tried empty topology anyways.  This
>>> should reduce any gc or worker initialization related failures though they
>>> are still possible.  As Erik mentioned check ZK.  Also I'm not sure if this
>>> is still required but it used to be helpful to make sure your storm nodes
>>> have each other listed in /etc/hosts.
>>> On Jun 12, 2015 8:59 AM, "Nathan Leung" <nc...@gmail.com> wrote:
>>>
>>>> Make sure your topology is starting up in the allotted time, and if not
>>>> try increasing the startup timeout.
>>>> On Jun 12, 2015 2:46 AM, "Fang Chen" <fc...@gmail.com> wrote:
>>>>
>>>>> Hi Erik
>>>>>
>>>>> Thanks for your reply!  It's great to hear about real production
>>>>> usages. For our use case, we are really puzzled by the outcome so far. The
>>>>> initial investigation seems to indicate that workers don't die by
>>>>> themselves ( i actually tried killing the supervisor and the worker would
>>>>> continue running beyond 30 minutes).
>>>>>
>>>>> The sequence of events is like this:  supervisor immediately complains
>>>>> worker "still has not started" for a few seconds right after launching the
>>>>> worker process, then silent --> after 26 minutes, nimbus complains
>>>>> executors (related to the worker) "not alive" and started to reassign
>>>>> topology --> after another ~500 milliseconds, the supervisor shuts down its
>>>>> worker --> other peer workers complain about netty issues. and the loop
>>>>> goes on.
>>>>>
>>>>> Could you kindly tell me what version of zookeeper is used with 0.9.4?
>>>>> and how many nodes in the zookeeper cluster?
>>>>>
>>>>> I wonder if this is due to zookeeper issues.
>>>>>
>>>>> Thanks a lot,
>>>>> Fang
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <eweathers@groupon.com
>>>>> > wrote:
>>>>>
>>>>>> Hey Fang,
>>>>>>
>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and
>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of 30+
>>>>>> nodes.
>>>>>>
>>>>>> One of the challenges with storm is figuring out what the root cause
>>>>>> is when things go haywire.  You'll wanna examine why the nimbus decided to
>>>>>> restart your worker processes.  It would happen when workers die and the
>>>>>> nimbus notices that storm executors aren't alive.  (There are logs in
>>>>>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>>>>>> looking at logs on the worker hosts.
>>>>>>
>>>>>> - Erik
>>>>>>
>>>>>>
>>>>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>>>>
>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not
>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), and
>>>>>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>>>>>> different disks.
>>>>>>>
>>>>>>> I have huge troubles to give my data analytics topology a stable
>>>>>>> run. So I tried the simplest topology I can think of, just an emtpy bolt,
>>>>>>> no io except for reading from kafka queue.
>>>>>>>
>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt
>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field grouping,
>>>>>>> msg size=1k).
>>>>>>> After 26 minutes, nimbus orders to kill the topology as it believe
>>>>>>> the topology is dead, then after another 2 minutes, another kill, then
>>>>>>> another after another 4 minutes, and on and on.
>>>>>>>
>>>>>>> I can understand there might be issues in the coordination among
>>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable
>>>>>>> workarounds? I wish there are as so many of you are using it in production
>>>>>>> :-)
>>>>>>>
>>>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>>>> topology working!
>>>>>>>
>>>>>>> Fang
>>>>>>>
>>>>>>>
>>>>>
>>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Binh Nguyen Van <bi...@gmail.com>.
Hi Fang,

Did you check your GC log? Do you see anything abnormal?
What is your current max spout pending setting?

On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <fc...@gmail.com> wrote:

> I also did this and find no success.
>
> Thanks,
> Fang
>
> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <nc...@gmail.com> wrote:
>
>> After I wrote that I realized you tried empty topology anyways.  This
>> should reduce any gc or worker initialization related failures though they
>> are still possible.  As Erik mentioned check ZK.  Also I'm not sure if this
>> is still required but it used to be helpful to make sure your storm nodes
>> have each other listed in /etc/hosts.
>> On Jun 12, 2015 8:59 AM, "Nathan Leung" <nc...@gmail.com> wrote:
>>
>>> Make sure your topology is starting up in the allotted time, and if not
>>> try increasing the startup timeout.
>>> On Jun 12, 2015 2:46 AM, "Fang Chen" <fc...@gmail.com> wrote:
>>>
>>>> Hi Erik
>>>>
>>>> Thanks for your reply!  It's great to hear about real production
>>>> usages. For our use case, we are really puzzled by the outcome so far. The
>>>> initial investigation seems to indicate that workers don't die by
>>>> themselves ( i actually tried killing the supervisor and the worker would
>>>> continue running beyond 30 minutes).
>>>>
>>>> The sequence of events is like this:  supervisor immediately complains
>>>> worker "still has not started" for a few seconds right after launching the
>>>> worker process, then silent --> after 26 minutes, nimbus complains
>>>> executors (related to the worker) "not alive" and started to reassign
>>>> topology --> after another ~500 milliseconds, the supervisor shuts down its
>>>> worker --> other peer workers complain about netty issues. and the loop
>>>> goes on.
>>>>
>>>> Could you kindly tell me what version of zookeeper is used with 0.9.4?
>>>> and how many nodes in the zookeeper cluster?
>>>>
>>>> I wonder if this is due to zookeeper issues.
>>>>
>>>> Thanks a lot,
>>>> Fang
>>>>
>>>>
>>>>
>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <ew...@groupon.com>
>>>> wrote:
>>>>
>>>>> Hey Fang,
>>>>>
>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm
>>>>> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.
>>>>>
>>>>> One of the challenges with storm is figuring out what the root cause
>>>>> is when things go haywire.  You'll wanna examine why the nimbus decided to
>>>>> restart your worker processes.  It would happen when workers die and the
>>>>> nimbus notices that storm executors aren't alive.  (There are logs in
>>>>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>>>>> looking at logs on the worker hosts.
>>>>>
>>>>> - Erik
>>>>>
>>>>>
>>>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>>>
>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
>>>>>> 0.9.5 yet but I don't see any significant differences there), and
>>>>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>>>>> different disks.
>>>>>>
>>>>>> I have huge troubles to give my data analytics topology a stable run.
>>>>>> So I tried the simplest topology I can think of, just an emtpy bolt, no io
>>>>>> except for reading from kafka queue.
>>>>>>
>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
>>>>>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
>>>>>> size=1k).
>>>>>> After 26 minutes, nimbus orders to kill the topology as it believe
>>>>>> the topology is dead, then after another 2 minutes, another kill, then
>>>>>> another after another 4 minutes, and on and on.
>>>>>>
>>>>>> I can understand there might be issues in the coordination among
>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable
>>>>>> workarounds? I wish there are as so many of you are using it in production
>>>>>> :-)
>>>>>>
>>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>>> topology working!
>>>>>>
>>>>>> Fang
>>>>>>
>>>>>>
>>>>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Fang Chen <fc...@gmail.com>.
I also did this and find no success.

Thanks,
Fang

On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <nc...@gmail.com> wrote:

> After I wrote that I realized you tried empty topology anyways.  This
> should reduce any gc or worker initialization related failures though they
> are still possible.  As Erik mentioned check ZK.  Also I'm not sure if this
> is still required but it used to be helpful to make sure your storm nodes
> have each other listed in /etc/hosts.
> On Jun 12, 2015 8:59 AM, "Nathan Leung" <nc...@gmail.com> wrote:
>
>> Make sure your topology is starting up in the allotted time, and if not
>> try increasing the startup timeout.
>> On Jun 12, 2015 2:46 AM, "Fang Chen" <fc...@gmail.com> wrote:
>>
>>> Hi Erik
>>>
>>> Thanks for your reply!  It's great to hear about real production usages.
>>> For our use case, we are really puzzled by the outcome so far. The initial
>>> investigation seems to indicate that workers don't die by themselves ( i
>>> actually tried killing the supervisor and the worker would continue running
>>> beyond 30 minutes).
>>>
>>> The sequence of events is like this:  supervisor immediately complains
>>> worker "still has not started" for a few seconds right after launching the
>>> worker process, then silent --> after 26 minutes, nimbus complains
>>> executors (related to the worker) "not alive" and started to reassign
>>> topology --> after another ~500 milliseconds, the supervisor shuts down its
>>> worker --> other peer workers complain about netty issues. and the loop
>>> goes on.
>>>
>>> Could you kindly tell me what version of zookeeper is used with 0.9.4?
>>> and how many nodes in the zookeeper cluster?
>>>
>>> I wonder if this is due to zookeeper issues.
>>>
>>> Thanks a lot,
>>> Fang
>>>
>>>
>>>
>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <ew...@groupon.com>
>>> wrote:
>>>
>>>> Hey Fang,
>>>>
>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm
>>>> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.
>>>>
>>>> One of the challenges with storm is figuring out what the root cause is
>>>> when things go haywire.  You'll wanna examine why the nimbus decided to
>>>> restart your worker processes.  It would happen when workers die and the
>>>> nimbus notices that storm executors aren't alive.  (There are logs in
>>>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>>>> looking at logs on the worker hosts.
>>>>
>>>> - Erik
>>>>
>>>>
>>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>>
>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
>>>>> 0.9.5 yet but I don't see any significant differences there), and
>>>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>>>> different disks.
>>>>>
>>>>> I have huge troubles to give my data analytics topology a stable run.
>>>>> So I tried the simplest topology I can think of, just an emtpy bolt, no io
>>>>> except for reading from kafka queue.
>>>>>
>>>>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
>>>>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
>>>>> size=1k).
>>>>> After 26 minutes, nimbus orders to kill the topology as it believe the
>>>>> topology is dead, then after another 2 minutes, another kill, then another
>>>>> after another 4 minutes, and on and on.
>>>>>
>>>>> I can understand there might be issues in the coordination among
>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable
>>>>> workarounds? I wish there are as so many of you are using it in production
>>>>> :-)
>>>>>
>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>> topology working!
>>>>>
>>>>> Fang
>>>>>
>>>>>
>>>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Fang Chen <fc...@gmail.com>.
Thank you Nathan!

I will try to a setup with /etc/hosts and see if that makes any difference.

Thanks,
Fang

On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <nc...@gmail.com> wrote:

> After I wrote that I realized you tried empty topology anyways.  This
> should reduce any gc or worker initialization related failures though they
> are still possible.  As Erik mentioned check ZK.  Also I'm not sure if this
> is still required but it used to be helpful to make sure your storm nodes
> have each other listed in /etc/hosts.
> On Jun 12, 2015 8:59 AM, "Nathan Leung" <nc...@gmail.com> wrote:
>
>> Make sure your topology is starting up in the allotted time, and if not
>> try increasing the startup timeout.
>> On Jun 12, 2015 2:46 AM, "Fang Chen" <fc...@gmail.com> wrote:
>>
>>> Hi Erik
>>>
>>> Thanks for your reply!  It's great to hear about real production usages.
>>> For our use case, we are really puzzled by the outcome so far. The initial
>>> investigation seems to indicate that workers don't die by themselves ( i
>>> actually tried killing the supervisor and the worker would continue running
>>> beyond 30 minutes).
>>>
>>> The sequence of events is like this:  supervisor immediately complains
>>> worker "still has not started" for a few seconds right after launching the
>>> worker process, then silent --> after 26 minutes, nimbus complains
>>> executors (related to the worker) "not alive" and started to reassign
>>> topology --> after another ~500 milliseconds, the supervisor shuts down its
>>> worker --> other peer workers complain about netty issues. and the loop
>>> goes on.
>>>
>>> Could you kindly tell me what version of zookeeper is used with 0.9.4?
>>> and how many nodes in the zookeeper cluster?
>>>
>>> I wonder if this is due to zookeeper issues.
>>>
>>> Thanks a lot,
>>> Fang
>>>
>>>
>>>
>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <ew...@groupon.com>
>>> wrote:
>>>
>>>> Hey Fang,
>>>>
>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm
>>>> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.
>>>>
>>>> One of the challenges with storm is figuring out what the root cause is
>>>> when things go haywire.  You'll wanna examine why the nimbus decided to
>>>> restart your worker processes.  It would happen when workers die and the
>>>> nimbus notices that storm executors aren't alive.  (There are logs in
>>>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>>>> looking at logs on the worker hosts.
>>>>
>>>> - Erik
>>>>
>>>>
>>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>>
>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
>>>>> 0.9.5 yet but I don't see any significant differences there), and
>>>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>>>> different disks.
>>>>>
>>>>> I have huge troubles to give my data analytics topology a stable run.
>>>>> So I tried the simplest topology I can think of, just an emtpy bolt, no io
>>>>> except for reading from kafka queue.
>>>>>
>>>>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
>>>>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
>>>>> size=1k).
>>>>> After 26 minutes, nimbus orders to kill the topology as it believe the
>>>>> topology is dead, then after another 2 minutes, another kill, then another
>>>>> after another 4 minutes, and on and on.
>>>>>
>>>>> I can understand there might be issues in the coordination among
>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable
>>>>> workarounds? I wish there are as so many of you are using it in production
>>>>> :-)
>>>>>
>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>> topology working!
>>>>>
>>>>> Fang
>>>>>
>>>>>
>>>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Nathan Leung <nc...@gmail.com>.
After I wrote that I realized you tried empty topology anyways.  This
should reduce any gc or worker initialization related failures though they
are still possible.  As Erik mentioned check ZK.  Also I'm not sure if this
is still required but it used to be helpful to make sure your storm nodes
have each other listed in /etc/hosts.
On Jun 12, 2015 8:59 AM, "Nathan Leung" <nc...@gmail.com> wrote:

> Make sure your topology is starting up in the allotted time, and if not
> try increasing the startup timeout.
> On Jun 12, 2015 2:46 AM, "Fang Chen" <fc...@gmail.com> wrote:
>
>> Hi Erik
>>
>> Thanks for your reply!  It's great to hear about real production usages.
>> For our use case, we are really puzzled by the outcome so far. The initial
>> investigation seems to indicate that workers don't die by themselves ( i
>> actually tried killing the supervisor and the worker would continue running
>> beyond 30 minutes).
>>
>> The sequence of events is like this:  supervisor immediately complains
>> worker "still has not started" for a few seconds right after launching the
>> worker process, then silent --> after 26 minutes, nimbus complains
>> executors (related to the worker) "not alive" and started to reassign
>> topology --> after another ~500 milliseconds, the supervisor shuts down its
>> worker --> other peer workers complain about netty issues. and the loop
>> goes on.
>>
>> Could you kindly tell me what version of zookeeper is used with 0.9.4?
>> and how many nodes in the zookeeper cluster?
>>
>> I wonder if this is due to zookeeper issues.
>>
>> Thanks a lot,
>> Fang
>>
>>
>>
>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <ew...@groupon.com>
>> wrote:
>>
>>> Hey Fang,
>>>
>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm
>>> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.
>>>
>>> One of the challenges with storm is figuring out what the root cause is
>>> when things go haywire.  You'll wanna examine why the nimbus decided to
>>> restart your worker processes.  It would happen when workers die and the
>>> nimbus notices that storm executors aren't alive.  (There are logs in
>>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>>> looking at logs on the worker hosts.
>>>
>>> - Erik
>>>
>>>
>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>
>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
>>>> 0.9.5 yet but I don't see any significant differences there), and
>>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>>> different disks.
>>>>
>>>> I have huge troubles to give my data analytics topology a stable run.
>>>> So I tried the simplest topology I can think of, just an emtpy bolt, no io
>>>> except for reading from kafka queue.
>>>>
>>>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
>>>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
>>>> size=1k).
>>>> After 26 minutes, nimbus orders to kill the topology as it believe the
>>>> topology is dead, then after another 2 minutes, another kill, then another
>>>> after another 4 minutes, and on and on.
>>>>
>>>> I can understand there might be issues in the coordination among
>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable
>>>> workarounds? I wish there are as so many of you are using it in production
>>>> :-)
>>>>
>>>> I deeply appreciate any suggestions that could even make my toy
>>>> topology working!
>>>>
>>>> Fang
>>>>
>>>>
>>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Nathan Leung <nc...@gmail.com>.
Make sure your topology is starting up in the allotted time, and if not try
increasing the startup timeout.
On Jun 12, 2015 2:46 AM, "Fang Chen" <fc...@gmail.com> wrote:

> Hi Erik
>
> Thanks for your reply!  It's great to hear about real production usages.
> For our use case, we are really puzzled by the outcome so far. The initial
> investigation seems to indicate that workers don't die by themselves ( i
> actually tried killing the supervisor and the worker would continue running
> beyond 30 minutes).
>
> The sequence of events is like this:  supervisor immediately complains
> worker "still has not started" for a few seconds right after launching the
> worker process, then silent --> after 26 minutes, nimbus complains
> executors (related to the worker) "not alive" and started to reassign
> topology --> after another ~500 milliseconds, the supervisor shuts down its
> worker --> other peer workers complain about netty issues. and the loop
> goes on.
>
> Could you kindly tell me what version of zookeeper is used with 0.9.4? and
> how many nodes in the zookeeper cluster?
>
> I wonder if this is due to zookeeper issues.
>
> Thanks a lot,
> Fang
>
>
>
> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <ew...@groupon.com>
> wrote:
>
>> Hey Fang,
>>
>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm
>> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.
>>
>> One of the challenges with storm is figuring out what the root cause is
>> when things go haywire.  You'll wanna examine why the nimbus decided to
>> restart your worker processes.  It would happen when workers die and the
>> nimbus notices that storm executors aren't alive.  (There are logs in
>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>> looking at logs on the worker hosts.
>>
>> - Erik
>>
>>
>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>
>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
>>> 0.9.5 yet but I don't see any significant differences there), and
>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>> different disks.
>>>
>>> I have huge troubles to give my data analytics topology a stable run. So
>>> I tried the simplest topology I can think of, just an emtpy bolt, no io
>>> except for reading from kafka queue.
>>>
>>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
>>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
>>> size=1k).
>>> After 26 minutes, nimbus orders to kill the topology as it believe the
>>> topology is dead, then after another 2 minutes, another kill, then another
>>> after another 4 minutes, and on and on.
>>>
>>> I can understand there might be issues in the coordination among nimbus,
>>> worker and executor (e.g., heartbeats). But are there any doable
>>> workarounds? I wish there are as so many of you are using it in production
>>> :-)
>>>
>>> I deeply appreciate any suggestions that could even make my toy topology
>>> working!
>>>
>>> Fang
>>>
>>>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Fang Chen <fc...@gmail.com>.
Thanks Derek!  I just tried forceSync=no, but it did not make any
difference.

Fang


On Sat, Jun 13, 2015 at 11:23 AM, Derek Dagit <de...@yahoo-inc.com> wrote:

> It sounds like a ZooKeeper issue to me.
>
> Some things we have seen before:
>
> 1. Too much write load on the disks from ZK in general.
> 2. ZK distribution comes with a tool that will clean-up/purge old files.
> When ZK purges its edit logs/data logs, it can remove several large files.
> On a file-system like EXT, these deletions create non-trivial disk load and
> can block everything else going on until the purge is done.
>
>
> For 1. We run production with one work-around that revans2 found that
> *significantly* helps:
>
> -Dzookeeper.forceSync=no
>
> Add this to the JVM arguments when launching each of the ZK servers in
> your cluster.  Normally, each write is written to disk in order to persist
> the changes.  But in Storm, we figured this data really does not need to be
> persisted so aggressively.  If enough ZK nodes go down at once, data can be
> lost.  In practice this is a risk that we think we can take.
>
> After using the work-around for 1., 2. did not matter so much, but we
> still have a tool that spaces out the deletes based on disk performance.
>
> --
> Derek
>
>   ------------------------------
>  *From:* Erik Weathers <ew...@groupon.com>
> *To:* "user@storm.apache.org" <us...@storm.apache.org>
> *Sent:* Saturday, June 13, 2015 1:52 AM
> *Subject:* Re: Has anybody successfully run storm 0.9+ in production
> under reasonable load?
>
> There is something fundamentally wrong.  You need to get to the root cause
> of what the worker process is doing that is preventing the heartbeats from
> arriving.
>
> - Erik
>
> On Friday, June 12, 2015, Fang Chen <fc...@gmail.com> wrote:
>
>
> I tuned up all worker timeout and task time out to 600 seconds, and seems
> like nimbus is happy about it after running the topology for 40minutes. But
> still one supervisor complained timeout from worker and then shut it down:
>
> 2015-06-12T23:59:20.633+0000 b.s.d.supervisor [INFO] Shutting down and
> clearing state for id 0cbfe8e5-b41f-451b-9005-107cef9b9e28. Current
> supervisor time: 1434153560. State: :timed-out, Heartbeat:
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1434152959,
> :storm-id "asyncVarGenTopology-1-1434151135", :executors #{[6 6] [11 11]
> [16 16] [21 21] [26 26] [-1 -1] [1 1]}, :port 6703}
>
>
> It's so hard to believe that even 600 seconds is not enough.
>
> On Fri, Jun 12, 2015 at 3:27 PM, Fang Chen <fc...@gmail.com> wrote:
>
> I turned on debug and seems like the nimbus reassign was indeed caused by
> heartbeat timeouts after running the topology for about 20 minutes. You can
> see that those non-live executors have a ":is-timed-out true"  status and
> executor reported time is about 100 second behind nimbus time, while other
> live executors have executor time head of nimbus time.
>
>
> ==========
>
> Heartbeat cache: {[2 2] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146480}, [3 3] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [4 4]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146479}, [5 5] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146480}, [6 6] {:is-timed-out true,
> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [7 7]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146480}, [8 8] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146480}, [9 9] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [10 10]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146480}, [11 11] {:is-timed-out true, :nimbus-time 1434146355,
> :executor-reported-time 1434146250}, [12 12] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [13 13]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146480}, [14 14] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146479}, [15 15] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [16 16]
> {:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time
> 1434146250}, [17 17] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146480}, [18 18] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [19 19]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146479}, [20 20] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146480}, [21 21] {:is-timed-out true,
> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [22 22]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146480}, [23 23] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146480}, [24 24] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [25 25]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146480}, [26 26] {:is-timed-out true, :nimbus-time 1434146355,
> :executor-reported-time 1434146250}, [27 27] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [28 28]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146480}, [1 1] {:is-timed-out true, :nimbus-time 1434146355,
> :executor-reported-time 1434146250}}
> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
> asyncVarGenTopology-1-1434145217:[6 6] not alive
> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
> asyncVarGenTopology-1-1434145217:[11 11] not alive
> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
> asyncVarGenTopology-1-1434145217:[16 16] not alive
> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
> asyncVarGenTopology-1-1434145217:[21 21] not alive
> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
> asyncVarGenTopology-1-1434145217:[26 26] not alive
> 2015-06-12T22:01:15.998+0000 b.s.d.nimbus [INFO] Executor
> asyncVarGenTopology-1-1434145217:[1 1] not alive
>
>
> On Fri, Jun 12, 2015 at 2:47 PM, Fang Chen <fc...@gmail.com> wrote:
>
> supervisor.heartbeat.frequency.secs 5
> supervisor.monitor.frequency.secs 3
>
> task.heartbeat.frequency.secs 3
> worker.heartbeat.frequency.secs 1
>
> some nimbus parameters:
>
> nimbus.monitor.freq.secs 120
> nimbus.reassign true
> nimbus.supervisor.timeout.secs 60
> nimbus.task.launch.secs 120
> nimbus.task.timeout.secs 30
>
> When worker dies, the log in one of supervisors shows shutting down worker
> with  state of disallowed (which I googled around and some people say it's
> due to nimbus reassign). Other logs only show shutting down worker without
> any further information.
>
>
> On Fri, Jun 12, 2015 at 2:09 AM, Erik Weathers <ew...@groupon.com>
> wrote:
>
> I'll have to look later, I think we are using ZooKeeper v3.3.6 (something
> like that).  Some clusters have 3 ZK hosts, some 5.
>
> The way the nimbus detects that the executors are not alive is by not
> seeing heartbeats updated in ZK.  There has to be some cause for the
> heartbeats not being updated.  Most likely one is that the worker
> process is dead.  Another one could be that the process is too busy Garbage
> Collecting, and so missed the timeout for updating the heartbeat.
>
> Regarding Supervisor and Worker: I think it's normal for the worker to be
> able to live absent the presence of the supervisor, so that sounds like
> expected behavior.
>
> What are your timeouts for the various heartbeats?
>
> Also, when the worker dies you should see a log from the supervisor
> noticing it.
>
> - Erik
>
> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>
> Hi Erik
>
> Thanks for your reply!  It's great to hear about real production usages.
> For our use case, we are really puzzled by the outcome so far. The initial
> investigation seems to indicate that workers don't die by themselves ( i
> actually tried killing the supervisor and the worker would continue running
> beyond 30 minutes).
>
> The sequence of events is like this:  supervisor immediately complains
> worker "still has not started" for a few seconds right after launching the
> worker process, then silent --> after 26 minutes, nimbus complains
> executors (related to the worker) "not alive" and started to reassign
> topology --> after another ~500 milliseconds, the supervisor shuts down its
> worker --> other peer workers complain about netty issues. and the loop
> goes on.
>
> Could you kindly tell me what version of zookeeper is used with 0.9.4? and
> how many nodes in the zookeeper cluster?
>
> I wonder if this is due to zookeeper issues.
>
> Thanks a lot,
> Fang
>
>
>
> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <ew...@groupon.com>
> wrote:
>
> Hey Fang,
>
> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm
> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.
>
> One of the challenges with storm is figuring out what the root cause is
> when things go haywire.  You'll wanna examine why the nimbus decided to
> restart your worker processes.  It would happen when workers die and the
> nimbus notices that storm executors aren't alive.  (There are logs in
> nimbus for this.)  Then you'll wanna dig into why the workers died by
> looking at logs on the worker hosts.
>
> - Erik
>
>
> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>
> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
> 0.9.5 yet but I don't see any significant differences there), and
> unfortunately we could not even have a clean run for over 30 minutes on a
> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
> different disks.
>
> I have huge troubles to give my data analytics topology a stable run. So I
> tried the simplest topology I can think of, just an emtpy bolt, no io
> except for reading from kafka queue.
>
> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
> size=1k).
> After 26 minutes, nimbus orders to kill the topology as it believe the
> topology is dead, then after another 2 minutes, another kill, then another
> after another 4 minutes, and on and on.
>
> I can understand there might be issues in the coordination among nimbus,
> worker and executor (e.g., heartbeats). But are there any doable
> workarounds? I wish there are as so many of you are using it in production
> :-)
>
> I deeply appreciate any suggestions that could even make my toy topology
> working!
>
> Fang
>
>
>
>
>
>
>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Derek Dagit <de...@yahoo-inc.com>.
It sounds like a ZooKeeper issue to me.
Some things we have seen before:
1. Too much write load on the disks from ZK in general.
2. ZK distribution comes with a tool that will clean-up/purge old files.  When ZK purges its edit logs/data logs, it can remove several large files.  On a file-system like EXT, these deletions create non-trivial disk load and can block everything else going on until the purge is done.


For 1. We run production with one work-around that revans2 found that *significantly* helps:
-Dzookeeper.forceSync=no
Add this to the JVM arguments when launching each of the ZK servers in your cluster.  Normally, each write is written to disk in order to persist the changes.  But in Storm, we figured this data really does not need to be persisted so aggressively.  If enough ZK nodes go down at once, data can be lost.  In practice this is a risk that we think we can take.

After using the work-around for 1., 2. did not matter so much, but we still have a tool that spaces out the deletes based on disk performance. -- 
Derek

      From: Erik Weathers <ew...@groupon.com>
 To: "user@storm.apache.org" <us...@storm.apache.org> 
 Sent: Saturday, June 13, 2015 1:52 AM
 Subject: Re: Has anybody successfully run storm 0.9+ in production under reasonable load?
   
There is something fundamentally wrong.  You need to get to the root cause of what the worker process is doing that is preventing the heartbeats from arriving.
- Erik

On Friday, June 12, 2015, Fang Chen <fc...@gmail.com> wrote:



I tuned up all worker timeout and task time out to 600 seconds, and seems like nimbus is happy about it after running the topology for 40minutes. But still one supervisor complained timeout from worker and then shut it down:
2015-06-12T23:59:20.633+0000 b.s.d.supervisor [INFO] Shutting down and clearing state for id 0cbfe8e5-b41f-451b-9005-107cef9b9e28. Current supervisor time: 1434153560. State: :timed-out, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1434152959, :storm-id "asyncVarGenTopology-1-1434151135", :executors #{[6 6] [11 11] [16 16] [21 21] [26 26] [-1 -1] [1 1]}, :port 6703}


It's so hard to believe that even 600 seconds is not enough.
On Fri, Jun 12, 2015 at 3:27 PM, Fang Chen <fc...@gmail.com> wrote:

I turned on debug and seems like the nimbus reassign was indeed caused by heartbeat timeouts after running the topology for about 20 minutes. You can see that those non-live executors have a ":is-timed-out true"  status and executor reported time is about 100 second behind nimbus time, while other live executors have executor time head of nimbus time.

==========

Heartbeat cache: {[2 2] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146480}, [3 3] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146480}, [4 4] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146479}, [5 5] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146480}, [6 6] {:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time 1434146250}, [7 7] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146480}, [8 8] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146480}, [9 9] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146479}, [10 10] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146480}, [11 11] {:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time 1434146250}, [12 12] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146480}, [13 13] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146480}, [14 14] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146479}, [15 15] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146480}, [16 16] {:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time 1434146250}, [17 17] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146480}, [18 18] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146480}, [19 19] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146479}, [20 20] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146480}, [21 21] {:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time 1434146250}, [22 22] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146480}, [23 23] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146480}, [24 24] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146479}, [25 25] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146480}, [26 26] {:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time 1434146250}, [27 27] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146480}, [28 28] {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 1434146480}, [1 1] {:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time 1434146250}}2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor asyncVarGenTopology-1-1434145217:[6 6] not alive2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor asyncVarGenTopology-1-1434145217:[11 11] not alive2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor asyncVarGenTopology-1-1434145217:[16 16] not alive2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor asyncVarGenTopology-1-1434145217:[21 21] not alive2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor asyncVarGenTopology-1-1434145217:[26 26] not alive2015-06-12T22:01:15.998+0000 b.s.d.nimbus [INFO] Executor asyncVarGenTopology-1-1434145217:[1 1] not alive

On Fri, Jun 12, 2015 at 2:47 PM, Fang Chen <fc...@gmail.com> wrote:

supervisor.heartbeat.frequency.secs 5
supervisor.monitor.frequency.secs 3

task.heartbeat.frequency.secs 3
worker.heartbeat.frequency.secs 1

some nimbus parameters:
nimbus.monitor.freq.secs 120nimbus.reassign truenimbus.supervisor.timeout.secs 60nimbus.task.launch.secs 120nimbus.task.timeout.secs 30
When worker dies, the log in one of supervisors shows shutting down worker with  state of disallowed (which I googled around and some people say it's due to nimbus reassign). Other logs only show shutting down worker without any further information.

On Fri, Jun 12, 2015 at 2:09 AM, Erik Weathers <ew...@groupon.com> wrote:

I'll have to look later, I think we are using ZooKeeper v3.3.6 (something like that).  Some clusters have 3 ZK hosts, some 5.
The way the nimbus detects that the executors are not alive is by not seeing heartbeats updated in ZK.  There has to be some cause for the heartbeats not being updated.  Most likely one is that the worker process is dead.  Another one could be that the process is too busy Garbage Collecting, and so missed the timeout for updating the heartbeat.
Regarding Supervisor and Worker: I think it's normal for the worker to be able to live absent the presence of the supervisor, so that sounds like expected behavior.
What are your timeouts for the various heartbeats?
Also, when the worker dies you should see a log from the supervisor noticing it.

- Erik

On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:

Hi Erik
Thanks for your reply!  It's great to hear about real production usages. For our use case, we are really puzzled by the outcome so far. The initial investigation seems to indicate that workers don't die by themselves ( i actually tried killing the supervisor and the worker would continue running beyond 30 minutes).  
The sequence of events is like this:  supervisor immediately complains worker "still has not started" for a few seconds right after launching the worker process, then silent --> after 26 minutes, nimbus complains executors (related to the worker) "not alive" and started to reassign topology --> after another ~500 milliseconds, the supervisor shuts down its worker --> other peer workers complain about netty issues. and the loop goes on.
Could you kindly tell me what version of zookeeper is used with 0.9.4? and how many nodes in the zookeeper cluster? 
I wonder if this is due to zookeeper issues.
Thanks a lot,Fang
 
On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <ew...@groupon.com> wrote:

Hey Fang,
Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.
One of the challenges with storm is figuring out what the root cause is when things go haywire.  You'll wanna examine why the nimbus decided to restart your worker processes.  It would happen when workers die and the nimbus notices that storm executors aren't alive.  (There are logs in nimbus for this.)  Then you'll wanna dig into why the workers died by looking at logs on the worker hosts.
- Erik

On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:

We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried 0.9.5 yet but I don't see any significant differences there), and unfortunately we could not even have a clean run for over 30 minutes on a cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on different disks.
I have huge troubles to give my data analytics topology a stable run. So I tried the simplest topology I can think of, just an emtpy bolt, no io except for reading from kafka queue. 
Just to report my latest testing on 0.9.4 with this empty bolt (kakfa topic partition=1, spout task #=1, bolt #=20 with field grouping, msg size=1k). After 26 minutes, nimbus orders to kill the topology as it believe the topology is dead, then after another 2 minutes, another kill, then another after another 4 minutes, and on and on. 

I can understand there might be issues in the coordination among nimbus, worker and executor (e.g., heartbeats). But are there any doable workarounds? I wish there are as so many of you are using it in production :-)
I deeply appreciate any suggestions that could even make my toy topology working!
Fang













  

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Fang Chen <fc...@gmail.com>.
I found a temporary workaround which has just make my toy topology last for
over 90 minutes now. I manually restarted all supervisors when I found out
any worker went into the hung state and it seems like every component is
happy now. I did this just once so I don't know if I need to do it again :-(

I originally thought it might be due to supervisor jvm memory issues, but I
tuned all mem parameters and enabled gc, but found nothing suspicious.  I
will test with my real topology later.

Thanks,
Fang


On Sun, Jun 14, 2015 at 11:04 PM, Fang Chen <fc...@gmail.com> wrote:

> I also believe something is going on but just can't find it out.
>
> But I do observe in all my experiments, the first worker that started to
> lose heartbeats is the one with kakfa spout task (I have only one spout
> task). And when it happens, it seems like the whole worker process hangs,
> none of the bolt tasks running in the same worker could spit out logging
> messages.  I double-checked the spout creation process, there is nothing
> suspicious since I don't really do anything except for using the
> forceFromStart param.   This continues until the worker gets recycled by
> supervisor after 600 seconds timeout.  Then I see another worker (on
> another node) starts to lose heartbeats.
>
> If I kill all supervisors, I can no longer observe the hang issue. It
> seems as if the supervisor is also affecting the worker behavior (maybe by
> reading heartbeats for zk??)
>
> I am planning to re-arrange the cluster and see if that makes any
> difference.
>
>
>
>
> On Fri, Jun 12, 2015 at 11:52 PM, Erik Weathers <ew...@groupon.com>
> wrote:
>
>> There is something fundamentally wrong.  You need to get to the root
>> cause of what the worker process is doing that is preventing the heartbeats
>> from arriving.
>>
>> - Erik
>>
>>
>> On Friday, June 12, 2015, Fang Chen <fc...@gmail.com> wrote:
>>
>>> I tuned up all worker timeout and task time out to 600 seconds, and
>>> seems like nimbus is happy about it after running the topology for
>>> 40minutes. But still one supervisor complained timeout from worker and then
>>> shut it down:
>>>
>>> 2015-06-12T23:59:20.633+0000 b.s.d.supervisor [INFO] Shutting down and
>>> clearing state for id 0cbfe8e5-b41f-451b-9005-107cef9b9e28. Current
>>> supervisor time: 1434153560. State: :timed-out, Heartbeat:
>>> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1434152959,
>>> :storm-id "asyncVarGenTopology-1-1434151135", :executors #{[6 6] [11 11]
>>> [16 16] [21 21] [26 26] [-1 -1] [1 1]}, :port 6703}
>>>
>>>
>>> It's so hard to believe that even 600 seconds is not enough.
>>>
>>> On Fri, Jun 12, 2015 at 3:27 PM, Fang Chen <fc...@gmail.com> wrote:
>>>
>>>> I turned on debug and seems like the nimbus reassign was indeed caused
>>>> by heartbeat timeouts after running the topology for about 20 minutes. You
>>>> can see that those non-live executors have a ":is-timed-out true"  status
>>>> and executor reported time is about 100 second behind nimbus time, while
>>>> other live executors have executor time head of nimbus time.
>>>>
>>>>
>>>> ==========
>>>>
>>>> Heartbeat cache: {[2 2] {:is-timed-out false, :nimbus-time 1434146475,
>>>> :executor-reported-time 1434146480}, [3 3] {:is-timed-out false,
>>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [4 4]
>>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>>> 1434146479}, [5 5] {:is-timed-out false, :nimbus-time 1434146475,
>>>> :executor-reported-time 1434146480}, [6 6] {:is-timed-out true,
>>>> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [7 7]
>>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>>> 1434146480}, [8 8] {:is-timed-out false, :nimbus-time 1434146475,
>>>> :executor-reported-time 1434146480}, [9 9] {:is-timed-out false,
>>>> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [10 10]
>>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>>> 1434146480}, [11 11] {:is-timed-out true, :nimbus-time 1434146355,
>>>> :executor-reported-time 1434146250}, [12 12] {:is-timed-out false,
>>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [13 13]
>>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>>> 1434146480}, [14 14] {:is-timed-out false, :nimbus-time 1434146475,
>>>> :executor-reported-time 1434146479}, [15 15] {:is-timed-out false,
>>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [16 16]
>>>> {:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time
>>>> 1434146250}, [17 17] {:is-timed-out false, :nimbus-time 1434146475,
>>>> :executor-reported-time 1434146480}, [18 18] {:is-timed-out false,
>>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [19 19]
>>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>>> 1434146479}, [20 20] {:is-timed-out false, :nimbus-time 1434146475,
>>>> :executor-reported-time 1434146480}, [21 21] {:is-timed-out true,
>>>> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [22 22]
>>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>>> 1434146480}, [23 23] {:is-timed-out false, :nimbus-time 1434146475,
>>>> :executor-reported-time 1434146480}, [24 24] {:is-timed-out false,
>>>> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [25 25]
>>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>>> 1434146480}, [26 26] {:is-timed-out true, :nimbus-time 1434146355,
>>>> :executor-reported-time 1434146250}, [27 27] {:is-timed-out false,
>>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [28 28]
>>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>>> 1434146480}, [1 1] {:is-timed-out true, :nimbus-time 1434146355,
>>>> :executor-reported-time 1434146250}}
>>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>>> asyncVarGenTopology-1-1434145217:[6 6] not alive
>>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>>> asyncVarGenTopology-1-1434145217:[11 11] not alive
>>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>>> asyncVarGenTopology-1-1434145217:[16 16] not alive
>>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>>> asyncVarGenTopology-1-1434145217:[21 21] not alive
>>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>>> asyncVarGenTopology-1-1434145217:[26 26] not alive
>>>> 2015-06-12T22:01:15.998+0000 b.s.d.nimbus [INFO] Executor
>>>> asyncVarGenTopology-1-1434145217:[1 1] not alive
>>>>
>>>>
>>>> On Fri, Jun 12, 2015 at 2:47 PM, Fang Chen <fc...@gmail.com> wrote:
>>>>
>>>>> supervisor.heartbeat.frequency.secs 5
>>>>> supervisor.monitor.frequency.secs 3
>>>>>
>>>>> task.heartbeat.frequency.secs 3
>>>>> worker.heartbeat.frequency.secs 1
>>>>>
>>>>> some nimbus parameters:
>>>>>
>>>>> nimbus.monitor.freq.secs 120
>>>>> nimbus.reassign true
>>>>> nimbus.supervisor.timeout.secs 60
>>>>> nimbus.task.launch.secs 120
>>>>> nimbus.task.timeout.secs 30
>>>>>
>>>>> When worker dies, the log in one of supervisors shows shutting down
>>>>> worker with  state of disallowed (which I googled around and some people
>>>>> say it's due to nimbus reassign). Other logs only show shutting down worker
>>>>> without any further information.
>>>>>
>>>>>
>>>>> On Fri, Jun 12, 2015 at 2:09 AM, Erik Weathers <ew...@groupon.com>
>>>>> wrote:
>>>>>
>>>>>> I'll have to look later, I think we are using ZooKeeper v3.3.6
>>>>>> (something like that).  Some clusters have 3 ZK hosts, some 5.
>>>>>>
>>>>>> The way the nimbus detects that the executors are not alive is by not
>>>>>> seeing heartbeats updated in ZK.  There has to be some cause for the
>>>>>> heartbeats not being updated.  Most likely one is that the worker
>>>>>> process is dead.  Another one could be that the process is too busy Garbage
>>>>>> Collecting, and so missed the timeout for updating the heartbeat.
>>>>>>
>>>>>> Regarding Supervisor and Worker: I think it's normal for the worker
>>>>>> to be able to live absent the presence of the supervisor, so that sounds
>>>>>> like expected behavior.
>>>>>>
>>>>>> What are your timeouts for the various heartbeats?
>>>>>>
>>>>>> Also, when the worker dies you should see a log from the supervisor
>>>>>> noticing it.
>>>>>>
>>>>>> - Erik
>>>>>>
>>>>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Erik
>>>>>>>
>>>>>>> Thanks for your reply!  It's great to hear about real production
>>>>>>> usages. For our use case, we are really puzzled by the outcome so far. The
>>>>>>> initial investigation seems to indicate that workers don't die by
>>>>>>> themselves ( i actually tried killing the supervisor and the worker would
>>>>>>> continue running beyond 30 minutes).
>>>>>>>
>>>>>>> The sequence of events is like this:  supervisor immediately
>>>>>>> complains worker "still has not started" for a few seconds right after
>>>>>>> launching the worker process, then silent --> after 26 minutes, nimbus
>>>>>>> complains executors (related to the worker) "not alive" and started to
>>>>>>> reassign topology --> after another ~500 milliseconds, the supervisor shuts
>>>>>>> down its worker --> other peer workers complain about netty issues. and the
>>>>>>> loop goes on.
>>>>>>>
>>>>>>> Could you kindly tell me what version of zookeeper is used with
>>>>>>> 0.9.4? and how many nodes in the zookeeper cluster?
>>>>>>>
>>>>>>> I wonder if this is due to zookeeper issues.
>>>>>>>
>>>>>>> Thanks a lot,
>>>>>>> Fang
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <
>>>>>>> eweathers@groupon.com> wrote:
>>>>>>>
>>>>>>>> Hey Fang,
>>>>>>>>
>>>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and
>>>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of 30+
>>>>>>>> nodes.
>>>>>>>>
>>>>>>>> One of the challenges with storm is figuring out what the root
>>>>>>>> cause is when things go haywire.  You'll wanna examine why the nimbus
>>>>>>>> decided to restart your worker processes.  It would happen when workers die
>>>>>>>> and the nimbus notices that storm executors aren't alive.  (There are logs
>>>>>>>> in nimbus for this.)  Then you'll wanna dig into why the workers died by
>>>>>>>> looking at logs on the worker hosts.
>>>>>>>>
>>>>>>>> - Erik
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not
>>>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), and
>>>>>>>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>>>>>>>> different disks.
>>>>>>>>>
>>>>>>>>> I have huge troubles to give my data analytics topology a stable
>>>>>>>>> run. So I tried the simplest topology I can think of, just an emtpy bolt,
>>>>>>>>> no io except for reading from kafka queue.
>>>>>>>>>
>>>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt
>>>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field grouping,
>>>>>>>>> msg size=1k).
>>>>>>>>> After 26 minutes, nimbus orders to kill the topology as it believe
>>>>>>>>> the topology is dead, then after another 2 minutes, another kill, then
>>>>>>>>> another after another 4 minutes, and on and on.
>>>>>>>>>
>>>>>>>>> I can understand there might be issues in the coordination among
>>>>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable
>>>>>>>>> workarounds? I wish there are as so many of you are using it in production
>>>>>>>>> :-)
>>>>>>>>>
>>>>>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>>>>>> topology working!
>>>>>>>>>
>>>>>>>>> Fang
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Fang Chen <fc...@gmail.com>.
I also believe something is going on but just can't find it out.

But I do observe in all my experiments, the first worker that started to
lose heartbeats is the one with kakfa spout task (I have only one spout
task). And when it happens, it seems like the whole worker process hangs,
none of the bolt tasks running in the same worker could spit out logging
messages.  I double-checked the spout creation process, there is nothing
suspicious since I don't really do anything except for using the
forceFromStart param.   This continues until the worker gets recycled by
supervisor after 600 seconds timeout.  Then I see another worker (on
another node) starts to lose heartbeats.

If I kill all supervisors, I can no longer observe the hang issue. It seems
as if the supervisor is also affecting the worker behavior (maybe by
reading heartbeats for zk??)

I am planning to re-arrange the cluster and see if that makes any
difference.




On Fri, Jun 12, 2015 at 11:52 PM, Erik Weathers <ew...@groupon.com>
wrote:

> There is something fundamentally wrong.  You need to get to the root cause
> of what the worker process is doing that is preventing the heartbeats from
> arriving.
>
> - Erik
>
>
> On Friday, June 12, 2015, Fang Chen <fc...@gmail.com> wrote:
>
>> I tuned up all worker timeout and task time out to 600 seconds, and seems
>> like nimbus is happy about it after running the topology for 40minutes. But
>> still one supervisor complained timeout from worker and then shut it down:
>>
>> 2015-06-12T23:59:20.633+0000 b.s.d.supervisor [INFO] Shutting down and
>> clearing state for id 0cbfe8e5-b41f-451b-9005-107cef9b9e28. Current
>> supervisor time: 1434153560. State: :timed-out, Heartbeat:
>> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1434152959,
>> :storm-id "asyncVarGenTopology-1-1434151135", :executors #{[6 6] [11 11]
>> [16 16] [21 21] [26 26] [-1 -1] [1 1]}, :port 6703}
>>
>>
>> It's so hard to believe that even 600 seconds is not enough.
>>
>> On Fri, Jun 12, 2015 at 3:27 PM, Fang Chen <fc...@gmail.com> wrote:
>>
>>> I turned on debug and seems like the nimbus reassign was indeed caused
>>> by heartbeat timeouts after running the topology for about 20 minutes. You
>>> can see that those non-live executors have a ":is-timed-out true"  status
>>> and executor reported time is about 100 second behind nimbus time, while
>>> other live executors have executor time head of nimbus time.
>>>
>>>
>>> ==========
>>>
>>> Heartbeat cache: {[2 2] {:is-timed-out false, :nimbus-time 1434146475,
>>> :executor-reported-time 1434146480}, [3 3] {:is-timed-out false,
>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [4 4]
>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>> 1434146479}, [5 5] {:is-timed-out false, :nimbus-time 1434146475,
>>> :executor-reported-time 1434146480}, [6 6] {:is-timed-out true,
>>> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [7 7]
>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>> 1434146480}, [8 8] {:is-timed-out false, :nimbus-time 1434146475,
>>> :executor-reported-time 1434146480}, [9 9] {:is-timed-out false,
>>> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [10 10]
>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>> 1434146480}, [11 11] {:is-timed-out true, :nimbus-time 1434146355,
>>> :executor-reported-time 1434146250}, [12 12] {:is-timed-out false,
>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [13 13]
>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>> 1434146480}, [14 14] {:is-timed-out false, :nimbus-time 1434146475,
>>> :executor-reported-time 1434146479}, [15 15] {:is-timed-out false,
>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [16 16]
>>> {:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time
>>> 1434146250}, [17 17] {:is-timed-out false, :nimbus-time 1434146475,
>>> :executor-reported-time 1434146480}, [18 18] {:is-timed-out false,
>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [19 19]
>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>> 1434146479}, [20 20] {:is-timed-out false, :nimbus-time 1434146475,
>>> :executor-reported-time 1434146480}, [21 21] {:is-timed-out true,
>>> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [22 22]
>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>> 1434146480}, [23 23] {:is-timed-out false, :nimbus-time 1434146475,
>>> :executor-reported-time 1434146480}, [24 24] {:is-timed-out false,
>>> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [25 25]
>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>> 1434146480}, [26 26] {:is-timed-out true, :nimbus-time 1434146355,
>>> :executor-reported-time 1434146250}, [27 27] {:is-timed-out false,
>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [28 28]
>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>> 1434146480}, [1 1] {:is-timed-out true, :nimbus-time 1434146355,
>>> :executor-reported-time 1434146250}}
>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>> asyncVarGenTopology-1-1434145217:[6 6] not alive
>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>> asyncVarGenTopology-1-1434145217:[11 11] not alive
>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>> asyncVarGenTopology-1-1434145217:[16 16] not alive
>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>> asyncVarGenTopology-1-1434145217:[21 21] not alive
>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>> asyncVarGenTopology-1-1434145217:[26 26] not alive
>>> 2015-06-12T22:01:15.998+0000 b.s.d.nimbus [INFO] Executor
>>> asyncVarGenTopology-1-1434145217:[1 1] not alive
>>>
>>>
>>> On Fri, Jun 12, 2015 at 2:47 PM, Fang Chen <fc...@gmail.com> wrote:
>>>
>>>> supervisor.heartbeat.frequency.secs 5
>>>> supervisor.monitor.frequency.secs 3
>>>>
>>>> task.heartbeat.frequency.secs 3
>>>> worker.heartbeat.frequency.secs 1
>>>>
>>>> some nimbus parameters:
>>>>
>>>> nimbus.monitor.freq.secs 120
>>>> nimbus.reassign true
>>>> nimbus.supervisor.timeout.secs 60
>>>> nimbus.task.launch.secs 120
>>>> nimbus.task.timeout.secs 30
>>>>
>>>> When worker dies, the log in one of supervisors shows shutting down
>>>> worker with  state of disallowed (which I googled around and some people
>>>> say it's due to nimbus reassign). Other logs only show shutting down worker
>>>> without any further information.
>>>>
>>>>
>>>> On Fri, Jun 12, 2015 at 2:09 AM, Erik Weathers <ew...@groupon.com>
>>>> wrote:
>>>>
>>>>> I'll have to look later, I think we are using ZooKeeper v3.3.6
>>>>> (something like that).  Some clusters have 3 ZK hosts, some 5.
>>>>>
>>>>> The way the nimbus detects that the executors are not alive is by not
>>>>> seeing heartbeats updated in ZK.  There has to be some cause for the
>>>>> heartbeats not being updated.  Most likely one is that the worker
>>>>> process is dead.  Another one could be that the process is too busy Garbage
>>>>> Collecting, and so missed the timeout for updating the heartbeat.
>>>>>
>>>>> Regarding Supervisor and Worker: I think it's normal for the worker to
>>>>> be able to live absent the presence of the supervisor, so that sounds like
>>>>> expected behavior.
>>>>>
>>>>> What are your timeouts for the various heartbeats?
>>>>>
>>>>> Also, when the worker dies you should see a log from the supervisor
>>>>> noticing it.
>>>>>
>>>>> - Erik
>>>>>
>>>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>>>
>>>>>> Hi Erik
>>>>>>
>>>>>> Thanks for your reply!  It's great to hear about real production
>>>>>> usages. For our use case, we are really puzzled by the outcome so far. The
>>>>>> initial investigation seems to indicate that workers don't die by
>>>>>> themselves ( i actually tried killing the supervisor and the worker would
>>>>>> continue running beyond 30 minutes).
>>>>>>
>>>>>> The sequence of events is like this:  supervisor immediately
>>>>>> complains worker "still has not started" for a few seconds right after
>>>>>> launching the worker process, then silent --> after 26 minutes, nimbus
>>>>>> complains executors (related to the worker) "not alive" and started to
>>>>>> reassign topology --> after another ~500 milliseconds, the supervisor shuts
>>>>>> down its worker --> other peer workers complain about netty issues. and the
>>>>>> loop goes on.
>>>>>>
>>>>>> Could you kindly tell me what version of zookeeper is used with
>>>>>> 0.9.4? and how many nodes in the zookeeper cluster?
>>>>>>
>>>>>> I wonder if this is due to zookeeper issues.
>>>>>>
>>>>>> Thanks a lot,
>>>>>> Fang
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <
>>>>>> eweathers@groupon.com> wrote:
>>>>>>
>>>>>>> Hey Fang,
>>>>>>>
>>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and
>>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of 30+
>>>>>>> nodes.
>>>>>>>
>>>>>>> One of the challenges with storm is figuring out what the root cause
>>>>>>> is when things go haywire.  You'll wanna examine why the nimbus decided to
>>>>>>> restart your worker processes.  It would happen when workers die and the
>>>>>>> nimbus notices that storm executors aren't alive.  (There are logs in
>>>>>>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>>>>>>> looking at logs on the worker hosts.
>>>>>>>
>>>>>>> - Erik
>>>>>>>
>>>>>>>
>>>>>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>>>>>
>>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not
>>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), and
>>>>>>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>>>>>>> different disks.
>>>>>>>>
>>>>>>>> I have huge troubles to give my data analytics topology a stable
>>>>>>>> run. So I tried the simplest topology I can think of, just an emtpy bolt,
>>>>>>>> no io except for reading from kafka queue.
>>>>>>>>
>>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt
>>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field grouping,
>>>>>>>> msg size=1k).
>>>>>>>> After 26 minutes, nimbus orders to kill the topology as it believe
>>>>>>>> the topology is dead, then after another 2 minutes, another kill, then
>>>>>>>> another after another 4 minutes, and on and on.
>>>>>>>>
>>>>>>>> I can understand there might be issues in the coordination among
>>>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable
>>>>>>>> workarounds? I wish there are as so many of you are using it in production
>>>>>>>> :-)
>>>>>>>>
>>>>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>>>>> topology working!
>>>>>>>>
>>>>>>>> Fang
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>
>>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Erik Weathers <ew...@groupon.com>.
There is something fundamentally wrong.  You need to get to the root cause
of what the worker process is doing that is preventing the heartbeats from
arriving.

- Erik

On Friday, June 12, 2015, Fang Chen <fc...@gmail.com> wrote:

> I tuned up all worker timeout and task time out to 600 seconds, and seems
> like nimbus is happy about it after running the topology for 40minutes. But
> still one supervisor complained timeout from worker and then shut it down:
>
> 2015-06-12T23:59:20.633+0000 b.s.d.supervisor [INFO] Shutting down and
> clearing state for id 0cbfe8e5-b41f-451b-9005-107cef9b9e28. Current
> supervisor time: 1434153560. State: :timed-out, Heartbeat:
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1434152959,
> :storm-id "asyncVarGenTopology-1-1434151135", :executors #{[6 6] [11 11]
> [16 16] [21 21] [26 26] [-1 -1] [1 1]}, :port 6703}
>
>
> It's so hard to believe that even 600 seconds is not enough.
>
> On Fri, Jun 12, 2015 at 3:27 PM, Fang Chen <fc2004@gmail.com
> <javascript:_e(%7B%7D,'cvml','fc2004@gmail.com');>> wrote:
>
>> I turned on debug and seems like the nimbus reassign was indeed caused by
>> heartbeat timeouts after running the topology for about 20 minutes. You can
>> see that those non-live executors have a ":is-timed-out true"  status and
>> executor reported time is about 100 second behind nimbus time, while other
>> live executors have executor time head of nimbus time.
>>
>>
>> ==========
>>
>> Heartbeat cache: {[2 2] {:is-timed-out false, :nimbus-time 1434146475,
>> :executor-reported-time 1434146480}, [3 3] {:is-timed-out false,
>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [4 4]
>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>> 1434146479}, [5 5] {:is-timed-out false, :nimbus-time 1434146475,
>> :executor-reported-time 1434146480}, [6 6] {:is-timed-out true,
>> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [7 7]
>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>> 1434146480}, [8 8] {:is-timed-out false, :nimbus-time 1434146475,
>> :executor-reported-time 1434146480}, [9 9] {:is-timed-out false,
>> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [10 10]
>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>> 1434146480}, [11 11] {:is-timed-out true, :nimbus-time 1434146355,
>> :executor-reported-time 1434146250}, [12 12] {:is-timed-out false,
>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [13 13]
>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>> 1434146480}, [14 14] {:is-timed-out false, :nimbus-time 1434146475,
>> :executor-reported-time 1434146479}, [15 15] {:is-timed-out false,
>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [16 16]
>> {:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time
>> 1434146250}, [17 17] {:is-timed-out false, :nimbus-time 1434146475,
>> :executor-reported-time 1434146480}, [18 18] {:is-timed-out false,
>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [19 19]
>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>> 1434146479}, [20 20] {:is-timed-out false, :nimbus-time 1434146475,
>> :executor-reported-time 1434146480}, [21 21] {:is-timed-out true,
>> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [22 22]
>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>> 1434146480}, [23 23] {:is-timed-out false, :nimbus-time 1434146475,
>> :executor-reported-time 1434146480}, [24 24] {:is-timed-out false,
>> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [25 25]
>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>> 1434146480}, [26 26] {:is-timed-out true, :nimbus-time 1434146355,
>> :executor-reported-time 1434146250}, [27 27] {:is-timed-out false,
>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [28 28]
>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>> 1434146480}, [1 1] {:is-timed-out true, :nimbus-time 1434146355,
>> :executor-reported-time 1434146250}}
>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>> asyncVarGenTopology-1-1434145217:[6 6] not alive
>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>> asyncVarGenTopology-1-1434145217:[11 11] not alive
>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>> asyncVarGenTopology-1-1434145217:[16 16] not alive
>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>> asyncVarGenTopology-1-1434145217:[21 21] not alive
>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>> asyncVarGenTopology-1-1434145217:[26 26] not alive
>> 2015-06-12T22:01:15.998+0000 b.s.d.nimbus [INFO] Executor
>> asyncVarGenTopology-1-1434145217:[1 1] not alive
>>
>>
>> On Fri, Jun 12, 2015 at 2:47 PM, Fang Chen <fc2004@gmail.com
>> <javascript:_e(%7B%7D,'cvml','fc2004@gmail.com');>> wrote:
>>
>>> supervisor.heartbeat.frequency.secs 5
>>> supervisor.monitor.frequency.secs 3
>>>
>>> task.heartbeat.frequency.secs 3
>>> worker.heartbeat.frequency.secs 1
>>>
>>> some nimbus parameters:
>>>
>>> nimbus.monitor.freq.secs 120
>>> nimbus.reassign true
>>> nimbus.supervisor.timeout.secs 60
>>> nimbus.task.launch.secs 120
>>> nimbus.task.timeout.secs 30
>>>
>>> When worker dies, the log in one of supervisors shows shutting down
>>> worker with  state of disallowed (which I googled around and some people
>>> say it's due to nimbus reassign). Other logs only show shutting down worker
>>> without any further information.
>>>
>>>
>>> On Fri, Jun 12, 2015 at 2:09 AM, Erik Weathers <eweathers@groupon.com
>>> <javascript:_e(%7B%7D,'cvml','eweathers@groupon.com');>> wrote:
>>>
>>>> I'll have to look later, I think we are using ZooKeeper v3.3.6
>>>> (something like that).  Some clusters have 3 ZK hosts, some 5.
>>>>
>>>> The way the nimbus detects that the executors are not alive is by not
>>>> seeing heartbeats updated in ZK.  There has to be some cause for the
>>>> heartbeats not being updated.  Most likely one is that the worker
>>>> process is dead.  Another one could be that the process is too busy Garbage
>>>> Collecting, and so missed the timeout for updating the heartbeat.
>>>>
>>>> Regarding Supervisor and Worker: I think it's normal for the worker to
>>>> be able to live absent the presence of the supervisor, so that sounds like
>>>> expected behavior.
>>>>
>>>> What are your timeouts for the various heartbeats?
>>>>
>>>> Also, when the worker dies you should see a log from the supervisor
>>>> noticing it.
>>>>
>>>> - Erik
>>>>
>>>> On Thursday, June 11, 2015, Fang Chen <fc2004@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','fc2004@gmail.com');>> wrote:
>>>>
>>>>> Hi Erik
>>>>>
>>>>> Thanks for your reply!  It's great to hear about real production
>>>>> usages. For our use case, we are really puzzled by the outcome so far. The
>>>>> initial investigation seems to indicate that workers don't die by
>>>>> themselves ( i actually tried killing the supervisor and the worker would
>>>>> continue running beyond 30 minutes).
>>>>>
>>>>> The sequence of events is like this:  supervisor immediately complains
>>>>> worker "still has not started" for a few seconds right after launching the
>>>>> worker process, then silent --> after 26 minutes, nimbus complains
>>>>> executors (related to the worker) "not alive" and started to reassign
>>>>> topology --> after another ~500 milliseconds, the supervisor shuts down its
>>>>> worker --> other peer workers complain about netty issues. and the loop
>>>>> goes on.
>>>>>
>>>>> Could you kindly tell me what version of zookeeper is used with 0.9.4?
>>>>> and how many nodes in the zookeeper cluster?
>>>>>
>>>>> I wonder if this is due to zookeeper issues.
>>>>>
>>>>> Thanks a lot,
>>>>> Fang
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <eweathers@groupon.com
>>>>> > wrote:
>>>>>
>>>>>> Hey Fang,
>>>>>>
>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and
>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of 30+
>>>>>> nodes.
>>>>>>
>>>>>> One of the challenges with storm is figuring out what the root cause
>>>>>> is when things go haywire.  You'll wanna examine why the nimbus decided to
>>>>>> restart your worker processes.  It would happen when workers die and the
>>>>>> nimbus notices that storm executors aren't alive.  (There are logs in
>>>>>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>>>>>> looking at logs on the worker hosts.
>>>>>>
>>>>>> - Erik
>>>>>>
>>>>>>
>>>>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>>>>
>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not
>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), and
>>>>>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>>>>>> different disks.
>>>>>>>
>>>>>>> I have huge troubles to give my data analytics topology a stable
>>>>>>> run. So I tried the simplest topology I can think of, just an emtpy bolt,
>>>>>>> no io except for reading from kafka queue.
>>>>>>>
>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt
>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field grouping,
>>>>>>> msg size=1k).
>>>>>>> After 26 minutes, nimbus orders to kill the topology as it believe
>>>>>>> the topology is dead, then after another 2 minutes, another kill, then
>>>>>>> another after another 4 minutes, and on and on.
>>>>>>>
>>>>>>> I can understand there might be issues in the coordination among
>>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable
>>>>>>> workarounds? I wish there are as so many of you are using it in production
>>>>>>> :-)
>>>>>>>
>>>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>>>> topology working!
>>>>>>>
>>>>>>> Fang
>>>>>>>
>>>>>>>
>>>>>
>>>
>>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Fang Chen <fc...@gmail.com>.
I tuned up all worker timeout and task time out to 600 seconds, and seems
like nimbus is happy about it after running the topology for 40minutes. But
still one supervisor complained timeout from worker and then shut it down:

2015-06-12T23:59:20.633+0000 b.s.d.supervisor [INFO] Shutting down and
clearing state for id 0cbfe8e5-b41f-451b-9005-107cef9b9e28. Current
supervisor time: 1434153560. State: :timed-out, Heartbeat:
#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1434152959,
:storm-id "asyncVarGenTopology-1-1434151135", :executors #{[6 6] [11 11]
[16 16] [21 21] [26 26] [-1 -1] [1 1]}, :port 6703}


It's so hard to believe that even 600 seconds is not enough.

On Fri, Jun 12, 2015 at 3:27 PM, Fang Chen <fc...@gmail.com> wrote:

> I turned on debug and seems like the nimbus reassign was indeed caused by
> heartbeat timeouts after running the topology for about 20 minutes. You can
> see that those non-live executors have a ":is-timed-out true"  status and
> executor reported time is about 100 second behind nimbus time, while other
> live executors have executor time head of nimbus time.
>
>
> ==========
>
> Heartbeat cache: {[2 2] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146480}, [3 3] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [4 4]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146479}, [5 5] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146480}, [6 6] {:is-timed-out true,
> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [7 7]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146480}, [8 8] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146480}, [9 9] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [10 10]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146480}, [11 11] {:is-timed-out true, :nimbus-time 1434146355,
> :executor-reported-time 1434146250}, [12 12] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [13 13]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146480}, [14 14] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146479}, [15 15] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [16 16]
> {:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time
> 1434146250}, [17 17] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146480}, [18 18] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [19 19]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146479}, [20 20] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146480}, [21 21] {:is-timed-out true,
> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [22 22]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146480}, [23 23] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146480}, [24 24] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [25 25]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146480}, [26 26] {:is-timed-out true, :nimbus-time 1434146355,
> :executor-reported-time 1434146250}, [27 27] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [28 28]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146480}, [1 1] {:is-timed-out true, :nimbus-time 1434146355,
> :executor-reported-time 1434146250}}
> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
> asyncVarGenTopology-1-1434145217:[6 6] not alive
> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
> asyncVarGenTopology-1-1434145217:[11 11] not alive
> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
> asyncVarGenTopology-1-1434145217:[16 16] not alive
> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
> asyncVarGenTopology-1-1434145217:[21 21] not alive
> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
> asyncVarGenTopology-1-1434145217:[26 26] not alive
> 2015-06-12T22:01:15.998+0000 b.s.d.nimbus [INFO] Executor
> asyncVarGenTopology-1-1434145217:[1 1] not alive
>
>
> On Fri, Jun 12, 2015 at 2:47 PM, Fang Chen <fc...@gmail.com> wrote:
>
>> supervisor.heartbeat.frequency.secs 5
>> supervisor.monitor.frequency.secs 3
>>
>> task.heartbeat.frequency.secs 3
>> worker.heartbeat.frequency.secs 1
>>
>> some nimbus parameters:
>>
>> nimbus.monitor.freq.secs 120
>> nimbus.reassign true
>> nimbus.supervisor.timeout.secs 60
>> nimbus.task.launch.secs 120
>> nimbus.task.timeout.secs 30
>>
>> When worker dies, the log in one of supervisors shows shutting down
>> worker with  state of disallowed (which I googled around and some people
>> say it's due to nimbus reassign). Other logs only show shutting down worker
>> without any further information.
>>
>>
>> On Fri, Jun 12, 2015 at 2:09 AM, Erik Weathers <ew...@groupon.com>
>> wrote:
>>
>>> I'll have to look later, I think we are using ZooKeeper v3.3.6
>>> (something like that).  Some clusters have 3 ZK hosts, some 5.
>>>
>>> The way the nimbus detects that the executors are not alive is by not
>>> seeing heartbeats updated in ZK.  There has to be some cause for the
>>> heartbeats not being updated.  Most likely one is that the worker
>>> process is dead.  Another one could be that the process is too busy Garbage
>>> Collecting, and so missed the timeout for updating the heartbeat.
>>>
>>> Regarding Supervisor and Worker: I think it's normal for the worker to
>>> be able to live absent the presence of the supervisor, so that sounds like
>>> expected behavior.
>>>
>>> What are your timeouts for the various heartbeats?
>>>
>>> Also, when the worker dies you should see a log from the supervisor
>>> noticing it.
>>>
>>> - Erik
>>>
>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>
>>>> Hi Erik
>>>>
>>>> Thanks for your reply!  It's great to hear about real production
>>>> usages. For our use case, we are really puzzled by the outcome so far. The
>>>> initial investigation seems to indicate that workers don't die by
>>>> themselves ( i actually tried killing the supervisor and the worker would
>>>> continue running beyond 30 minutes).
>>>>
>>>> The sequence of events is like this:  supervisor immediately complains
>>>> worker "still has not started" for a few seconds right after launching the
>>>> worker process, then silent --> after 26 minutes, nimbus complains
>>>> executors (related to the worker) "not alive" and started to reassign
>>>> topology --> after another ~500 milliseconds, the supervisor shuts down its
>>>> worker --> other peer workers complain about netty issues. and the loop
>>>> goes on.
>>>>
>>>> Could you kindly tell me what version of zookeeper is used with 0.9.4?
>>>> and how many nodes in the zookeeper cluster?
>>>>
>>>> I wonder if this is due to zookeeper issues.
>>>>
>>>> Thanks a lot,
>>>> Fang
>>>>
>>>>
>>>>
>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <ew...@groupon.com>
>>>> wrote:
>>>>
>>>>> Hey Fang,
>>>>>
>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm
>>>>> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.
>>>>>
>>>>> One of the challenges with storm is figuring out what the root cause
>>>>> is when things go haywire.  You'll wanna examine why the nimbus decided to
>>>>> restart your worker processes.  It would happen when workers die and the
>>>>> nimbus notices that storm executors aren't alive.  (There are logs in
>>>>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>>>>> looking at logs on the worker hosts.
>>>>>
>>>>> - Erik
>>>>>
>>>>>
>>>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>>>
>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
>>>>>> 0.9.5 yet but I don't see any significant differences there), and
>>>>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>>>>> different disks.
>>>>>>
>>>>>> I have huge troubles to give my data analytics topology a stable run.
>>>>>> So I tried the simplest topology I can think of, just an emtpy bolt, no io
>>>>>> except for reading from kafka queue.
>>>>>>
>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
>>>>>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
>>>>>> size=1k).
>>>>>> After 26 minutes, nimbus orders to kill the topology as it believe
>>>>>> the topology is dead, then after another 2 minutes, another kill, then
>>>>>> another after another 4 minutes, and on and on.
>>>>>>
>>>>>> I can understand there might be issues in the coordination among
>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable
>>>>>> workarounds? I wish there are as so many of you are using it in production
>>>>>> :-)
>>>>>>
>>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>>> topology working!
>>>>>>
>>>>>> Fang
>>>>>>
>>>>>>
>>>>
>>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Fang Chen <fc...@gmail.com>.
I turned on debug and seems like the nimbus reassign was indeed caused by
heartbeat timeouts after running the topology for about 20 minutes. You can
see that those non-live executors have a ":is-timed-out true"  status and
executor reported time is about 100 second behind nimbus time, while other
live executors have executor time head of nimbus time.


==========

Heartbeat cache: {[2 2] {:is-timed-out false, :nimbus-time 1434146475,
:executor-reported-time 1434146480}, [3 3] {:is-timed-out false,
:nimbus-time 1434146475, :executor-reported-time 1434146480}, [4 4]
{:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
1434146479}, [5 5] {:is-timed-out false, :nimbus-time 1434146475,
:executor-reported-time 1434146480}, [6 6] {:is-timed-out true,
:nimbus-time 1434146355, :executor-reported-time 1434146250}, [7 7]
{:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
1434146480}, [8 8] {:is-timed-out false, :nimbus-time 1434146475,
:executor-reported-time 1434146480}, [9 9] {:is-timed-out false,
:nimbus-time 1434146475, :executor-reported-time 1434146479}, [10 10]
{:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
1434146480}, [11 11] {:is-timed-out true, :nimbus-time 1434146355,
:executor-reported-time 1434146250}, [12 12] {:is-timed-out false,
:nimbus-time 1434146475, :executor-reported-time 1434146480}, [13 13]
{:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
1434146480}, [14 14] {:is-timed-out false, :nimbus-time 1434146475,
:executor-reported-time 1434146479}, [15 15] {:is-timed-out false,
:nimbus-time 1434146475, :executor-reported-time 1434146480}, [16 16]
{:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time
1434146250}, [17 17] {:is-timed-out false, :nimbus-time 1434146475,
:executor-reported-time 1434146480}, [18 18] {:is-timed-out false,
:nimbus-time 1434146475, :executor-reported-time 1434146480}, [19 19]
{:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
1434146479}, [20 20] {:is-timed-out false, :nimbus-time 1434146475,
:executor-reported-time 1434146480}, [21 21] {:is-timed-out true,
:nimbus-time 1434146355, :executor-reported-time 1434146250}, [22 22]
{:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
1434146480}, [23 23] {:is-timed-out false, :nimbus-time 1434146475,
:executor-reported-time 1434146480}, [24 24] {:is-timed-out false,
:nimbus-time 1434146475, :executor-reported-time 1434146479}, [25 25]
{:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
1434146480}, [26 26] {:is-timed-out true, :nimbus-time 1434146355,
:executor-reported-time 1434146250}, [27 27] {:is-timed-out false,
:nimbus-time 1434146475, :executor-reported-time 1434146480}, [28 28]
{:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
1434146480}, [1 1] {:is-timed-out true, :nimbus-time 1434146355,
:executor-reported-time 1434146250}}
2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
asyncVarGenTopology-1-1434145217:[6 6] not alive
2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
asyncVarGenTopology-1-1434145217:[11 11] not alive
2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
asyncVarGenTopology-1-1434145217:[16 16] not alive
2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
asyncVarGenTopology-1-1434145217:[21 21] not alive
2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
asyncVarGenTopology-1-1434145217:[26 26] not alive
2015-06-12T22:01:15.998+0000 b.s.d.nimbus [INFO] Executor
asyncVarGenTopology-1-1434145217:[1 1] not alive


On Fri, Jun 12, 2015 at 2:47 PM, Fang Chen <fc...@gmail.com> wrote:

> supervisor.heartbeat.frequency.secs 5
> supervisor.monitor.frequency.secs 3
>
> task.heartbeat.frequency.secs 3
> worker.heartbeat.frequency.secs 1
>
> some nimbus parameters:
>
> nimbus.monitor.freq.secs 120
> nimbus.reassign true
> nimbus.supervisor.timeout.secs 60
> nimbus.task.launch.secs 120
> nimbus.task.timeout.secs 30
>
> When worker dies, the log in one of supervisors shows shutting down worker
> with  state of disallowed (which I googled around and some people say it's
> due to nimbus reassign). Other logs only show shutting down worker without
> any further information.
>
>
> On Fri, Jun 12, 2015 at 2:09 AM, Erik Weathers <ew...@groupon.com>
> wrote:
>
>> I'll have to look later, I think we are using ZooKeeper v3.3.6 (something
>> like that).  Some clusters have 3 ZK hosts, some 5.
>>
>> The way the nimbus detects that the executors are not alive is by not
>> seeing heartbeats updated in ZK.  There has to be some cause for the
>> heartbeats not being updated.  Most likely one is that the worker
>> process is dead.  Another one could be that the process is too busy Garbage
>> Collecting, and so missed the timeout for updating the heartbeat.
>>
>> Regarding Supervisor and Worker: I think it's normal for the worker to be
>> able to live absent the presence of the supervisor, so that sounds like
>> expected behavior.
>>
>> What are your timeouts for the various heartbeats?
>>
>> Also, when the worker dies you should see a log from the supervisor
>> noticing it.
>>
>> - Erik
>>
>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>
>>> Hi Erik
>>>
>>> Thanks for your reply!  It's great to hear about real production usages.
>>> For our use case, we are really puzzled by the outcome so far. The initial
>>> investigation seems to indicate that workers don't die by themselves ( i
>>> actually tried killing the supervisor and the worker would continue running
>>> beyond 30 minutes).
>>>
>>> The sequence of events is like this:  supervisor immediately complains
>>> worker "still has not started" for a few seconds right after launching the
>>> worker process, then silent --> after 26 minutes, nimbus complains
>>> executors (related to the worker) "not alive" and started to reassign
>>> topology --> after another ~500 milliseconds, the supervisor shuts down its
>>> worker --> other peer workers complain about netty issues. and the loop
>>> goes on.
>>>
>>> Could you kindly tell me what version of zookeeper is used with 0.9.4?
>>> and how many nodes in the zookeeper cluster?
>>>
>>> I wonder if this is due to zookeeper issues.
>>>
>>> Thanks a lot,
>>> Fang
>>>
>>>
>>>
>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <ew...@groupon.com>
>>> wrote:
>>>
>>>> Hey Fang,
>>>>
>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm
>>>> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.
>>>>
>>>> One of the challenges with storm is figuring out what the root cause is
>>>> when things go haywire.  You'll wanna examine why the nimbus decided to
>>>> restart your worker processes.  It would happen when workers die and the
>>>> nimbus notices that storm executors aren't alive.  (There are logs in
>>>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>>>> looking at logs on the worker hosts.
>>>>
>>>> - Erik
>>>>
>>>>
>>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>>
>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
>>>>> 0.9.5 yet but I don't see any significant differences there), and
>>>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>>>> different disks.
>>>>>
>>>>> I have huge troubles to give my data analytics topology a stable run.
>>>>> So I tried the simplest topology I can think of, just an emtpy bolt, no io
>>>>> except for reading from kafka queue.
>>>>>
>>>>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
>>>>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
>>>>> size=1k).
>>>>> After 26 minutes, nimbus orders to kill the topology as it believe the
>>>>> topology is dead, then after another 2 minutes, another kill, then another
>>>>> after another 4 minutes, and on and on.
>>>>>
>>>>> I can understand there might be issues in the coordination among
>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable
>>>>> workarounds? I wish there are as so many of you are using it in production
>>>>> :-)
>>>>>
>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>> topology working!
>>>>>
>>>>> Fang
>>>>>
>>>>>
>>>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Fang Chen <fc...@gmail.com>.
supervisor.heartbeat.frequency.secs 5
supervisor.monitor.frequency.secs 3

task.heartbeat.frequency.secs 3
worker.heartbeat.frequency.secs 1

some nimbus parameters:

nimbus.monitor.freq.secs 120
nimbus.reassign true
nimbus.supervisor.timeout.secs 60
nimbus.task.launch.secs 120
nimbus.task.timeout.secs 30

When worker dies, the log in one of supervisors shows shutting down worker
with  state of disallowed (which I googled around and some people say it's
due to nimbus reassign). Other logs only show shutting down worker without
any further information.


On Fri, Jun 12, 2015 at 2:09 AM, Erik Weathers <ew...@groupon.com>
wrote:

> I'll have to look later, I think we are using ZooKeeper v3.3.6 (something
> like that).  Some clusters have 3 ZK hosts, some 5.
>
> The way the nimbus detects that the executors are not alive is by not
> seeing heartbeats updated in ZK.  There has to be some cause for the
> heartbeats not being updated.  Most likely one is that the worker
> process is dead.  Another one could be that the process is too busy Garbage
> Collecting, and so missed the timeout for updating the heartbeat.
>
> Regarding Supervisor and Worker: I think it's normal for the worker to be
> able to live absent the presence of the supervisor, so that sounds like
> expected behavior.
>
> What are your timeouts for the various heartbeats?
>
> Also, when the worker dies you should see a log from the supervisor
> noticing it.
>
> - Erik
>
> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>
>> Hi Erik
>>
>> Thanks for your reply!  It's great to hear about real production usages.
>> For our use case, we are really puzzled by the outcome so far. The initial
>> investigation seems to indicate that workers don't die by themselves ( i
>> actually tried killing the supervisor and the worker would continue running
>> beyond 30 minutes).
>>
>> The sequence of events is like this:  supervisor immediately complains
>> worker "still has not started" for a few seconds right after launching the
>> worker process, then silent --> after 26 minutes, nimbus complains
>> executors (related to the worker) "not alive" and started to reassign
>> topology --> after another ~500 milliseconds, the supervisor shuts down its
>> worker --> other peer workers complain about netty issues. and the loop
>> goes on.
>>
>> Could you kindly tell me what version of zookeeper is used with 0.9.4?
>> and how many nodes in the zookeeper cluster?
>>
>> I wonder if this is due to zookeeper issues.
>>
>> Thanks a lot,
>> Fang
>>
>>
>>
>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <ew...@groupon.com>
>> wrote:
>>
>>> Hey Fang,
>>>
>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm
>>> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.
>>>
>>> One of the challenges with storm is figuring out what the root cause is
>>> when things go haywire.  You'll wanna examine why the nimbus decided to
>>> restart your worker processes.  It would happen when workers die and the
>>> nimbus notices that storm executors aren't alive.  (There are logs in
>>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>>> looking at logs on the worker hosts.
>>>
>>> - Erik
>>>
>>>
>>> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>>>
>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
>>>> 0.9.5 yet but I don't see any significant differences there), and
>>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>>> different disks.
>>>>
>>>> I have huge troubles to give my data analytics topology a stable run.
>>>> So I tried the simplest topology I can think of, just an emtpy bolt, no io
>>>> except for reading from kafka queue.
>>>>
>>>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
>>>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
>>>> size=1k).
>>>> After 26 minutes, nimbus orders to kill the topology as it believe the
>>>> topology is dead, then after another 2 minutes, another kill, then another
>>>> after another 4 minutes, and on and on.
>>>>
>>>> I can understand there might be issues in the coordination among
>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable
>>>> workarounds? I wish there are as so many of you are using it in production
>>>> :-)
>>>>
>>>> I deeply appreciate any suggestions that could even make my toy
>>>> topology working!
>>>>
>>>> Fang
>>>>
>>>>
>>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Erik Weathers <ew...@groupon.com>.
I'll have to look later, I think we are using ZooKeeper v3.3.6 (something
like that).  Some clusters have 3 ZK hosts, some 5.

The way the nimbus detects that the executors are not alive is by not
seeing heartbeats updated in ZK.  There has to be some cause for the
heartbeats not being updated.  Most likely one is that the worker
process is dead.  Another one could be that the process is too busy Garbage
Collecting, and so missed the timeout for updating the heartbeat.

Regarding Supervisor and Worker: I think it's normal for the worker to be
able to live absent the presence of the supervisor, so that sounds like
expected behavior.

What are your timeouts for the various heartbeats?

Also, when the worker dies you should see a log from the supervisor
noticing it.

- Erik

On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:

> Hi Erik
>
> Thanks for your reply!  It's great to hear about real production usages.
> For our use case, we are really puzzled by the outcome so far. The initial
> investigation seems to indicate that workers don't die by themselves ( i
> actually tried killing the supervisor and the worker would continue running
> beyond 30 minutes).
>
> The sequence of events is like this:  supervisor immediately complains
> worker "still has not started" for a few seconds right after launching the
> worker process, then silent --> after 26 minutes, nimbus complains
> executors (related to the worker) "not alive" and started to reassign
> topology --> after another ~500 milliseconds, the supervisor shuts down its
> worker --> other peer workers complain about netty issues. and the loop
> goes on.
>
> Could you kindly tell me what version of zookeeper is used with 0.9.4? and
> how many nodes in the zookeeper cluster?
>
> I wonder if this is due to zookeeper issues.
>
> Thanks a lot,
> Fang
>
>
>
> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <eweathers@groupon.com
> <javascript:_e(%7B%7D,'cvml','eweathers@groupon.com');>> wrote:
>
>> Hey Fang,
>>
>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm
>> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.
>>
>> One of the challenges with storm is figuring out what the root cause is
>> when things go haywire.  You'll wanna examine why the nimbus decided to
>> restart your worker processes.  It would happen when workers die and the
>> nimbus notices that storm executors aren't alive.  (There are logs in
>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>> looking at logs on the worker hosts.
>>
>> - Erik
>>
>>
>> On Thursday, June 11, 2015, Fang Chen <fc2004@gmail.com
>> <javascript:_e(%7B%7D,'cvml','fc2004@gmail.com');>> wrote:
>>
>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
>>> 0.9.5 yet but I don't see any significant differences there), and
>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>> different disks.
>>>
>>> I have huge troubles to give my data analytics topology a stable run. So
>>> I tried the simplest topology I can think of, just an emtpy bolt, no io
>>> except for reading from kafka queue.
>>>
>>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
>>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
>>> size=1k).
>>> After 26 minutes, nimbus orders to kill the topology as it believe the
>>> topology is dead, then after another 2 minutes, another kill, then another
>>> after another 4 minutes, and on and on.
>>>
>>> I can understand there might be issues in the coordination among nimbus,
>>> worker and executor (e.g., heartbeats). But are there any doable
>>> workarounds? I wish there are as so many of you are using it in production
>>> :-)
>>>
>>> I deeply appreciate any suggestions that could even make my toy topology
>>> working!
>>>
>>> Fang
>>>
>>>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Fang Chen <fc...@gmail.com>.
Hi Erik

Thanks for your reply!  It's great to hear about real production usages.
For our use case, we are really puzzled by the outcome so far. The initial
investigation seems to indicate that workers don't die by themselves ( i
actually tried killing the supervisor and the worker would continue running
beyond 30 minutes).

The sequence of events is like this:  supervisor immediately complains
worker "still has not started" for a few seconds right after launching the
worker process, then silent --> after 26 minutes, nimbus complains
executors (related to the worker) "not alive" and started to reassign
topology --> after another ~500 milliseconds, the supervisor shuts down its
worker --> other peer workers complain about netty issues. and the loop
goes on.

Could you kindly tell me what version of zookeeper is used with 0.9.4? and
how many nodes in the zookeeper cluster?

I wonder if this is due to zookeeper issues.

Thanks a lot,
Fang



On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <ew...@groupon.com>
wrote:

> Hey Fang,
>
> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm
> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.
>
> One of the challenges with storm is figuring out what the root cause is
> when things go haywire.  You'll wanna examine why the nimbus decided to
> restart your worker processes.  It would happen when workers die and the
> nimbus notices that storm executors aren't alive.  (There are logs in
> nimbus for this.)  Then you'll wanna dig into why the workers died by
> looking at logs on the worker hosts.
>
> - Erik
>
>
> On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:
>
>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
>> 0.9.5 yet but I don't see any significant differences there), and
>> unfortunately we could not even have a clean run for over 30 minutes on a
>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>> different disks.
>>
>> I have huge troubles to give my data analytics topology a stable run. So
>> I tried the simplest topology I can think of, just an emtpy bolt, no io
>> except for reading from kafka queue.
>>
>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
>> size=1k).
>> After 26 minutes, nimbus orders to kill the topology as it believe the
>> topology is dead, then after another 2 minutes, another kill, then another
>> after another 4 minutes, and on and on.
>>
>> I can understand there might be issues in the coordination among nimbus,
>> worker and executor (e.g., heartbeats). But are there any doable
>> workarounds? I wish there are as so many of you are using it in production
>> :-)
>>
>> I deeply appreciate any suggestions that could even make my toy topology
>> working!
>>
>> Fang
>>
>>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Posted by Erik Weathers <ew...@groupon.com>.
Hey Fang,

Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm
0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.

One of the challenges with storm is figuring out what the root cause is
when things go haywire.  You'll wanna examine why the nimbus decided to
restart your worker processes.  It would happen when workers die and the
nimbus notices that storm executors aren't alive.  (There are logs in
nimbus for this.)  Then you'll wanna dig into why the workers died by
looking at logs on the worker hosts.

- Erik

On Thursday, June 11, 2015, Fang Chen <fc...@gmail.com> wrote:

> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
> 0.9.5 yet but I don't see any significant differences there), and
> unfortunately we could not even have a clean run for over 30 minutes on a
> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
> different disks.
>
> I have huge troubles to give my data analytics topology a stable run. So I
> tried the simplest topology I can think of, just an emtpy bolt, no io
> except for reading from kafka queue.
>
> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
> size=1k).
> After 26 minutes, nimbus orders to kill the topology as it believe the
> topology is dead, then after another 2 minutes, another kill, then another
> after another 4 minutes, and on and on.
>
> I can understand there might be issues in the coordination among nimbus,
> worker and executor (e.g., heartbeats). But are there any doable
> workarounds? I wish there are as so many of you are using it in production
> :-)
>
> I deeply appreciate any suggestions that could even make my toy topology
> working!
>
> Fang
>
>