You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Danijel Schiavuzzi <da...@schiavuzzi.com> on 2014/07/03 15:36:38 UTC

Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Hi Bobby,

Just an update on the stuck Trident transactional topology issue -- I've
upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and can't
reproduce the bug anymore. Will keep you posted if any issues arise.

Regards,

Danijel


On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <ev...@yahoo-inc.com> wrote:

>  I have not seen this before, if you could file a JIRA on this that would
> be great.
>
>  - Bobby
>
>   From: Danijel Schiavuzzi <da...@schiavuzzi.com>
> Reply-To: "user@storm.incubator.apache.org" <
> user@storm.incubator.apache.org>
> Date: Wednesday, June 4, 2014 at 10:30 AM
> To: "user@storm.incubator.apache.org" <us...@storm.incubator.apache.org>, "
> dev@storm.incubator.apache.org" <de...@storm.incubator.apache.org>
> Subject: Trident transactional topology stuck re-emitting batches with
> Netty, but running fine with ZMQ (was Re: Topology is stuck)
>
>   Hi all,
>
> I've managed to reproduce the stuck topology problem and it seems it's due
> to the Netty transport. Running with ZMQ transport enabled now and I
> haven't been able to reproduce this.
>
>  The problem is basically a Trident/Kafka transactional topology getting
> stuck, i.e. re-emitting the same batches over and over again. This happens
> after the Storm workers restart a few times due to Kafka spout throwing
> RuntimeExceptions (because of the Kafka consumer in the spout timing out
> with a SocketTimeoutException due to some temporary network problems).
> Sometimes the topology is stuck after just one worker is restarted, and
> sometimes a few worker restarts are needed to trigger the problem.
>
> I simulated the Kafka spout socket timeouts by blocking network access
> from Storm to my Kafka machines (with an iptables firewall rule). Most of
> the time the spouts (workers) would restart normally (after re-enabling
> access to Kafka) and the topology would continue to process batches, but
> sometimes the topology would get stuck re-emitting batches after the
> crashed workers restarted. Killing and re-submitting the topology manually
> fixes this always, and processing continues normally.
>
>  I haven't been able to reproduce this scenario after reverting my Storm
> cluster's transport to ZeroMQ. With Netty transport, I can almost always
> reproduce the problem by causing a worker to restart a number of times
> (only about 4-5 worker restarts are enough to trigger this).
>
>  Any hints on this? Anyone had the same problem? It does seem a serious
> issue as it affect the reliability and fault tolerance of the Storm cluster.
>
>  In the meantime, I'll try to prepare a reproducible test case for this.
>
>  Thanks,
>
> Danijel
>
>
> On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi <
> danijel@schiavuzzi.com> wrote:
>
>> To (partially) answer my own question -- I still have no idea on the
>> cause of the stuck topology, but re-submitting the topology helps -- after
>> re-submitting my topology is now running normally.
>>
>>
>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
>> danijel@schiavuzzi.com> wrote:
>>
>>>  Also, I did have multiple cases of my IBackingMap workers dying
>>> (because of RuntimeExceptions) but successfully restarting afterwards (I
>>> throw RuntimeExceptions in the BackingMap implementation as my strategy in
>>> rare SQL database deadlock situations to force a worker restart and to
>>> fail+retry the batch).
>>>
>>>  From the logs, one such IBackingMap worker death (and subsequent
>>> restart) resulted in the Kafka spout re-emitting the pending tuple:
>>>
>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting
>>> batch, attempt 29698959:736
>>>
>>>  This is of course the normal behavior of a transactional topology, but
>>> this is the first time I've encountered a case of a batch retrying
>>> indefinitely. This is especially suspicious since the topology has been
>>> running fine for 20 days straight, re-emitting batches and restarting
>>> IBackingMap workers quite a number of times.
>>>
>>> I can see in my IBackingMap backing SQL database that the batch with the
>>> exact txid value 29698959 has been committed -- but I suspect that could
>>> come from another BackingMap, since there are two BackingMap instances
>>> running (paralellismHint 2).
>>>
>>>  However, I have no idea why the batch is being retried indefinitely
>>> now nor why it hasn't been successfully acked by Trident.
>>>
>>> Any suggestions on the area (topology component) to focus my research on?
>>>
>>>  Thanks,
>>>
>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>>> danijel@schiavuzzi.com> wrote:
>>>
>>>>   Hello,
>>>>
>>>> I'm having problems with my transactional Trident topology. It has been
>>>> running fine for about 20 days, and suddenly is stuck processing a single
>>>> batch, with no tuples being emitted nor tuples being persisted by the
>>>> TridentState (IBackingMap).
>>>>
>>>> It's a simple topology which consumes messages off a Kafka queue. The
>>>> spout is an instance of storm-kafka-0.8-plus TransactionalTridentKafkaSpout
>>>> and I use the trident-mssql transactional TridentState implementation to
>>>> persistentAggregate() data into a SQL database.
>>>>
>>>>  In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>>
>>>>      "/transactional/<myTopologyName>/coordinator/currattempts" is
>>>> "{"29698959":6487}"
>>>>
>>>> ... and the attempt count keeps increasing. It seems the batch with
>>>> txid 29698959 is stuck, as the attempt count in Zookeeper keeps increasing
>>>> -- seems like the batch isn't being acked by Trident and I have no idea
>>>> why, especially since the topology has been running successfully the last
>>>> 20 days.
>>>>
>>>>  I did rebalance the topology on one occasion, after which it
>>>> continued running normally. Other than that, no other modifications were
>>>> done. Storm is at version 0.9.0.1.
>>>>
>>>>  Any hints on how to debug the stuck topology? Any other useful info I
>>>> might provide?
>>>>
>>>>  Thanks,
>>>>
>>>> --
>>>> Danijel Schiavuzzi
>>>>
>>>> E: danijel@schiavuzzi.com
>>>> W: www.schiavuzzi.com
>>>> T: +385989035562
>>>> Skype: danijel.schiavuzzi
>>>>
>>>
>>>
>>>
>>> --
>>> Danijel Schiavuzzi
>>>
>>> E: danijel@schiavuzzi.com
>>> W: www.schiavuzzi.com
>>> T: +385989035562
>>> Skype: danijel.schiavuzzi
>>>
>>
>>
>>
>> --
>> Danijel Schiavuzzi
>>
>> E: danijel@schiavuzzi.com
>> W: www.schiavuzzi.com
>> T: +385989035562
>>  Skype: danijels7
>>
>
>
>
> --
> Danijel Schiavuzzi
>
> E: danijel@schiavuzzi.com
> W: www.schiavuzzi.com
> T: +385989035562
> Skype: danijels7
>



-- 
Danijel Schiavuzzi

E: danijel@schiavuzzi.com
W: www.schiavuzzi.com
T: +385989035562
Skype: danijels7

Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Posted by "M.Tarkeshwar Rao" <ta...@gmail.com>.

In which version it is available.
On 16 Sep 2014 19:01, "Danijel Schiavuzzi" <da...@schiavuzzi.com> wrote:

> Yes, it's been fixed in 'master' for some time now.
>
> Danijel
>
> On Tuesday, September 16, 2014, M.Tarkeshwar Rao <ta...@gmail.com>
> wrote:
>
>> Hi Danijel,
>>
>> Is the issue resolved in any version of the storm?
>>
>> Regards
>> Tarkeshwar
>>
>> On Thu, Jul 17, 2014 at 6:57 PM, Danijel Schiavuzzi <
>> danijel@schiavuzzi.com> wrote:
>>
>>> I've filled a bug report for this under
>>> https://issues.apache.org/jira/browse/STORM-406
>>>
>>> The issue is 100% reproducible with, it seems, any Trident topology and
>>> across multiple Storm versions with Netty transport enabled. 0MQ is working
>>> fine. You can try with TridentWordCount from storm-starter, for example.
>>>
>>> Your insight seems correct: when the killed worker re-spawns on the same
>>> slot (port), the topology stops processing. See the above JIRA for
>>> additional info.
>>>
>>> Danijel
>>>
>>>
>>>
>>>
>>> On Thu, Jul 17, 2014 at 7:20 AM, M.Tarkeshwar Rao <
>>> tarkeshwar4u@gmail.com> wrote:
>>>
>>>> Thanks Danijel for helping me.
>>>>
>>>>
>>>> On Thu, Jul 17, 2014 at 1:37 AM, Danijel Schiavuzzi <
>>>> danijel@schiavuzzi.com> wrote:
>>>>
>>>>> I see no issues with your cluster configuration.
>>>>>
>>>>> You should definitely share the (simplified if possible) topology
>>>>> code and the steps to reproduce the blockage, better yet you should file a
>>>>> JIRA task on Apache's JIRA web -- be sure to include your Trident
>>>>> internals modifications.
>>>>>
>>>>> Unfortunately, seems I'm having the same issues now with Storm 0.9.2
>>>>> too, so I might get back here with some updates soon. It's not so fast
>>>>> and easily reproducible as it was under 0.9.1, but the bug
>>>>> seems nonetheless still present. I'll reduce the number of Storm slots and
>>>>> topology workers as per your insights, hopefully this might make it easier
>>>>> to reproduce the bug with a simplified Trident topology.
>>>>>
>>>>>
>>>>> On Tuesday, July 15, 2014, M.Tarkeshwar Rao <ta...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Denijel,
>>>>>>
>>>>>> We have done few changes in the the trident core framework code as
>>>>>> per our need which is working fine with zeromq. I am sharing configuration
>>>>>> which we are using. Can you please suggest our config is fine or not?
>>>>>>
>>>>>>  Code part is so large so we are writing some sample topology and
>>>>>> trying to reproduce the issue, which we will share with you.
>>>>>>
>>>>>> What are the steps to reproduce the issue:
>>>>>>  -------------------------------------------------------------
>>>>>>
>>>>>> 1. we deployed our topology with one linux machine, two workers and
>>>>>> one acker with batch size 2.
>>>>>> 2. both the worker are up and start the processing.
>>>>>> 3. after few seconds i killed one of the worker kill -9.
>>>>>> 4. when the killed worker spawned on the same port it is getting
>>>>>> hanged.
>>>>>> 5. only retries going on.
>>>>>> 6. when the killed worker spawned on the another port everything
>>>>>> working fine.
>>>>>>
>>>>>> machine conf:
>>>>>> --------------------------
>>>>>> [root@sb6270x1637-2 conf]# uname -a
>>>>>>
>>>>>> Linux bl460cx2378 2.6.32-431.5.1.el6.x86_64 #1 SMP Fri Jan 10
>>>>>> 14:46:43 EST 2014 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>
>>>>>>
>>>>>> *storm.yaml* which we are using to launch  nimbus, supervisor and ui
>>>>>>
>>>>>> ########## These MUST be filled in for a storm configuration
>>>>>>  storm.zookeeper.servers:
>>>>>>      - "10.61.244.86"
>>>>>>  storm.zookeeper.port: 2000
>>>>>>  supervisor.slots.ports:
>>>>>>     - 6788
>>>>>>     - 6789
>>>>>>     - 6800
>>>>>>     - 6801
>>>>>>     - 6802
>>>>>>      - 6803
>>>>>>
>>>>>>  nimbus.host: "10.61.244.86"
>>>>>>
>>>>>>
>>>>>>  storm.messaging.transport: "backtype.storm.messaging.netty.Context"
>>>>>>
>>>>>>  storm.messaging.netty.server_worker_threads: 10
>>>>>>  storm.messaging.netty.client_worker_threads: 10
>>>>>>  storm.messaging.netty.buffer_size: 5242880
>>>>>>  storm.messaging.netty.max_retries: 100
>>>>>>  storm.messaging.netty.max_wait_ms: 1000
>>>>>>  storm.messaging.netty.min_wait_ms: 100
>>>>>>  storm.local.dir: "/root/home_98/home/enavgoy/storm-local"
>>>>>>  storm.scheduler: "com.ericsson.storm.scheduler.TopologyScheduler"
>>>>>>  topology.acker.executors: 1
>>>>>>  topology.message.timeout.secs: 30
>>>>>>  supervisor.scheduler.meta:
>>>>>>       name: "supervisor1"
>>>>>>
>>>>>>
>>>>>>  worker.childopts: "-Xmx2048m"
>>>>>>
>>>>>>  mm.hdfs.ipaddress: "10.61.244.7"
>>>>>>  mm.hdfs.port: 9000
>>>>>>  topology.batch.size: 2
>>>>>>  topology.batch.timeout: 10000
>>>>>>  topology.workers: 2
>>>>>>  topology.debug: true
>>>>>>
>>>>>> Regards
>>>>>> Tarkeshwar
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 7, 2014 at 1:22 PM, Danijel Schiavuzzi <
>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>
>>>>>>> Hi Tarkeshwar,
>>>>>>>
>>>>>>> Could you provide a code sample of your topology? Do you have any
>>>>>>> special configs enabled?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Danijel
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jul 7, 2014 at 9:01 AM, M.Tarkeshwar Rao <
>>>>>>> tarkeshwar4u@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Danijel,
>>>>>>>>
>>>>>>>> We are able to reproduce this issue with 0.9.2 as well.
>>>>>>>> We have two worker setup to run the trident topology.
>>>>>>>>
>>>>>>>> When we kill one of the worker and again when that killed worker
>>>>>>>> spawn on same port(same slot) then that worker not able to communicate with
>>>>>>>> 2nd worker.
>>>>>>>>
>>>>>>>> only transaction attempts are increasing continuously.
>>>>>>>>
>>>>>>>> But if the killed worker spawn on new slot(new communication port)
>>>>>>>> then it working fine. Same behavior as in storm 9.0.1.
>>>>>>>>
>>>>>>>> Please update me if you get any new development.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Tarkeshwar
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 3, 2014 at 7:06 PM, Danijel Schiavuzzi <
>>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Bobby,
>>>>>>>>>
>>>>>>>>> Just an update on the stuck Trident transactional topology issue
>>>>>>>>> -- I've upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and
>>>>>>>>> can't reproduce the bug anymore. Will keep you posted if any issues arise.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Danijel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <ev...@yahoo-inc.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>  I have not seen this before, if you could file a JIRA on this
>>>>>>>>>> that would be great.
>>>>>>>>>>
>>>>>>>>>>  - Bobby
>>>>>>>>>>
>>>>>>>>>>   From: Danijel Schiavuzzi <da...@schiavuzzi.com>
>>>>>>>>>> Reply-To: "user@storm.incubator.apache.org" <
>>>>>>>>>> user@storm.incubator.apache.org>
>>>>>>>>>> Date: Wednesday, June 4, 2014 at 10:30 AM
>>>>>>>>>> To: "user@storm.incubator.apache.org" <
>>>>>>>>>> user@storm.incubator.apache.org>, "dev@storm.incubator.apache.org"
>>>>>>>>>> <de...@storm.incubator.apache.org>
>>>>>>>>>> Subject: Trident transactional topology stuck re-emitting
>>>>>>>>>> batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)
>>>>>>>>>>
>>>>>>>>>>   Hi all,
>>>>>>>>>>
>>>>>>>>>> I've managed to reproduce the stuck topology problem and it seems
>>>>>>>>>> it's due to the Netty transport. Running with ZMQ transport enabled now and
>>>>>>>>>> I haven't been able to reproduce this.
>>>>>>>>>>
>>>>>>>>>>  The problem is basically a Trident/Kafka transactional topology
>>>>>>>>>> getting stuck, i.e. re-emitting the same batches over and over again. This
>>>>>>>>>> happens after the Storm workers restart a few times due to Kafka spout
>>>>>>>>>> throwing RuntimeExceptions (because of the Kafka consumer in the spout
>>>>>>>>>> timing out with a SocketTimeoutException due to some temporary network
>>>>>>>>>> problems). Sometimes the topology is stuck after just one worker is
>>>>>>>>>> restarted, and sometimes a few worker restarts are needed to trigger the
>>>>>>>>>> problem.
>>>>>>>>>>
>>>>>>>>>> I simulated the Kafka spout socket timeouts by blocking network
>>>>>>>>>> access from Storm to my Kafka machines (with an iptables firewall rule).
>>>>>>>>>> Most of the time the spouts (workers) would restart normally (after
>>>>>>>>>> re-enabling access to Kafka) and the topology would continue to process
>>>>>>>>>> batches, but sometimes the topology would get stuck re-emitting batches
>>>>>>>>>> after the crashed workers restarted. Killing and re-submitting the topology
>>>>>>>>>> manually fixes this always, and processing continues normally.
>>>>>>>>>>
>>>>>>>>>>  I haven't been able to reproduce this scenario after reverting
>>>>>>>>>> my Storm cluster's transport to ZeroMQ. With Netty transport, I can almost
>>>>>>>>>> always reproduce the problem by causing a worker to restart a number of
>>>>>>>>>> times (only about 4-5 worker restarts are enough to trigger this).
>>>>>>>>>>
>>>>>>>>>>  Any hints on this? Anyone had the same problem? It does seem a
>>>>>>>>>> serious issue as it affect the reliability and fault tolerance of the Storm
>>>>>>>>>> cluster.
>>>>>>>>>>
>>>>>>>>>>  In the meantime, I'll try to prepare a reproducible test case
>>>>>>>>>> for this.
>>>>>>>>>>
>>>>>>>>>>  Thanks,
>>>>>>>>>>
>>>>>>>>>> Danijel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi <
>>>>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> To (partially) answer my own question -- I still have no idea on
>>>>>>>>>>> the cause of the stuck topology, but re-submitting the topology helps --
>>>>>>>>>>> after re-submitting my topology is now running normally.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
>>>>>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>  Also, I did have multiple cases of my IBackingMap workers
>>>>>>>>>>>> dying (because of RuntimeExceptions) but successfully restarting afterwards
>>>>>>>>>>>> (I throw RuntimeExceptions in the BackingMap implementation as my strategy
>>>>>>>>>>>> in rare SQL database deadlock situations to force a worker restart and to
>>>>>>>>>>>> fail+retry the batch).
>>>>>>>>>>>>
>>>>>>>>>>>>  From the logs, one such IBackingMap worker death (and
>>>>>>>>>>>> subsequent restart) resulted in the Kafka spout re-emitting the pending
>>>>>>>>>>>> tuple:
>>>>>>>>>>>>
>>>>>>>>>>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO]
>>>>>>>>>>>> re-emitting batch, attempt 29698959:736
>>>>>>>>>>>>
>>>>>>>>>>>>  This is of course the normal behavior of a transactional
>>>>>>>>>>>> topology, but this is the first time I've encountered a case of a batch
>>>>>>>>>>>> retrying indefinitely. This is especially suspicious since the topology has
>>>>>>>>>>>> been running fine for 20 days straight, re-emitting batches and restarting
>>>>>>>>>>>> IBackingMap workers quite a number of times.
>>>>>>>>>>>>
>>>>>>>>>>>> I can see in my IBackingMap backing SQL database that the batch
>>>>>>>>>>>> with the exact txid value 29698959 has been committed -- but I suspect that
>>>>>>>>>>>> could come from another BackingMap, since there are two BackingMap
>>>>>>>>>>>> instances running (paralellismHint 2).
>>>>>>>>>>>>
>>>>>>>>>>>>  However, I have no idea why the batch is being retried
>>>>>>>>>>>> indefinitely now nor why it hasn't been successfully acked by Trident.
>>>>>>>>>>>>
>>>>>>>>>>>> Any suggestions on the area (topology component) to focus my
>>>>>>>>>>>> research on?
>>>>>>>>>>>>
>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>>>>>>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>   Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm having problems with my transactional Trident topology. It
>>>>>>>>>>>>> has been running fine for about 20 days, and suddenly is stuck processing a
>>>>>>>>>>>>> single batch, with no tuples being emitted nor tuples being persisted by
>>>>>>>>>>>>> the TridentState (IBackingMap).
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's a simple topology which consumes messages off a Kafka
>>>>>>>>>>>>> queue. The spout is an instance of storm-kafka-0.8-plus
>>>>>>>>>>>>> TransactionalTridentKafkaSpout and I use the trident-mssql transactional
>>>>>>>>>>>>> TridentState implementation to persistentAggregate() data into a SQL
>>>>>>>>>>>>> database.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> "/transactional/<myTopologyName>/coordinator/currattempts" is
>>>>>>>>>>>>> "{"29698959":6487}"
>>>>>>>>>>>>>
>>>>>>>>>>>>> ... and the attempt count keeps increasing. It seems the batch
>>>>>>>>>>>>> with txid 29698959 is stuck, as the attempt count in Zookeeper keeps
>>>>>>>>>>>>> increasing -- seems like the batch isn't being acked by Trident and I have
>>>>>>>>>>>>> no idea why, especially since the topology has been running successfully
>>>>>>>>>>>>> the last 20 days.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  I did rebalance the topology on one occasion, after which it
>>>>>>>>>>>>> continued running normally. Other than that, no other modifications were
>>>>>>>>>>>>> done. Storm is at version 0.9.0.1.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Any hints on how to debug the stuck topology? Any other
>>>>>>>>>>>>> useful info I might provide?
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>>>>
>>>>>>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>>>>> T: +385989035562
>>>>>>>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>>>
>>>>>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>>>> T: +385989035562
>>>>>>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>>
>>>>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>>> T: +385989035562
>>>>>>>>>>>  Skype: danijels7
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>
>>>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>> T: +385989035562
>>>>>>>>>> Skype: danijels7
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>
>>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>> T: +385989035562
>>>>>>>>> Skype: danijels7
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Danijel Schiavuzzi
>>>>>>>
>>>>>>> E: danijel@schiavuzzi.com
>>>>>>> W: www.schiavuzzi.com
>>>>>>> T: +385989035562
>>>>>>> Skype: danijels7
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Danijel Schiavuzzi
>>>>>
>>>>> E: danijel@schiavuzzi.com
>>>>> W: www.schiavuzzi.com
>>>>> T: +385989035562
>>>>> Skype: danijels7
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Danijel Schiavuzzi
>>>
>>> E: danijel@schiavuzzi.com
>>> W: www.schiavuzzi.com
>>> T: +385989035562
>>> Skype: danijels7
>>>
>>
>>
>
> --
> Danijel Schiavuzzi
>
> E: danijel@schiavuzzi.com
> W: www.schiavuzzi.com
> T: +385 98 9035562
> Skype: danijel.schiavuzzi
>
>

Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Posted by Danijel Schiavuzzi <da...@schiavuzzi.com>.

Yes, it's been fixed in 'master' for some time now.

Danijel

On Tuesday, September 16, 2014, M.Tarkeshwar Rao <ta...@gmail.com>
wrote:

> Hi Danijel,
>
> Is the issue resolved in any version of the storm?
>
> Regards
> Tarkeshwar
>
> On Thu, Jul 17, 2014 at 6:57 PM, Danijel Schiavuzzi <
> danijel@schiavuzzi.com
> <javascript:_e(%7B%7D,'cvml','danijel@schiavuzzi.com');>> wrote:
>
>> I've filled a bug report for this under
>> https://issues.apache.org/jira/browse/STORM-406
>>
>> The issue is 100% reproducible with, it seems, any Trident topology and
>> across multiple Storm versions with Netty transport enabled. 0MQ is working
>> fine. You can try with TridentWordCount from storm-starter, for example.
>>
>> Your insight seems correct: when the killed worker re-spawns on the same
>> slot (port), the topology stops processing. See the above JIRA for
>> additional info.
>>
>> Danijel
>>
>>
>>
>>
>> On Thu, Jul 17, 2014 at 7:20 AM, M.Tarkeshwar Rao <tarkeshwar4u@gmail.com
>> <javascript:_e(%7B%7D,'cvml','tarkeshwar4u@gmail.com');>> wrote:
>>
>>> Thanks Danijel for helping me.
>>>
>>>
>>> On Thu, Jul 17, 2014 at 1:37 AM, Danijel Schiavuzzi <
>>> danijel@schiavuzzi.com
>>> <javascript:_e(%7B%7D,'cvml','danijel@schiavuzzi.com');>> wrote:
>>>
>>>> I see no issues with your cluster configuration.
>>>>
>>>> You should definitely share the (simplified if possible) topology
>>>> code and the steps to reproduce the blockage, better yet you should file a
>>>> JIRA task on Apache's JIRA web -- be sure to include your Trident
>>>> internals modifications.
>>>>
>>>> Unfortunately, seems I'm having the same issues now with Storm 0.9.2
>>>> too, so I might get back here with some updates soon. It's not so fast
>>>> and easily reproducible as it was under 0.9.1, but the bug
>>>> seems nonetheless still present. I'll reduce the number of Storm slots and
>>>> topology workers as per your insights, hopefully this might make it easier
>>>> to reproduce the bug with a simplified Trident topology.
>>>>
>>>>
>>>> On Tuesday, July 15, 2014, M.Tarkeshwar Rao <tarkeshwar4u@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','tarkeshwar4u@gmail.com');>> wrote:
>>>>
>>>>> Hi Denijel,
>>>>>
>>>>> We have done few changes in the the trident core framework code as per
>>>>> our need which is working fine with zeromq. I am sharing configuration
>>>>> which we are using. Can you please suggest our config is fine or not?
>>>>>
>>>>>  Code part is so large so we are writing some sample topology and
>>>>> trying to reproduce the issue, which we will share with you.
>>>>>
>>>>> What are the steps to reproduce the issue:
>>>>>  -------------------------------------------------------------
>>>>>
>>>>> 1. we deployed our topology with one linux machine, two workers and
>>>>> one acker with batch size 2.
>>>>> 2. both the worker are up and start the processing.
>>>>> 3. after few seconds i killed one of the worker kill -9.
>>>>> 4. when the killed worker spawned on the same port it is getting
>>>>> hanged.
>>>>> 5. only retries going on.
>>>>> 6. when the killed worker spawned on the another port everything
>>>>> working fine.
>>>>>
>>>>> machine conf:
>>>>> --------------------------
>>>>> [root@sb6270x1637-2 conf]# uname -a
>>>>>
>>>>> Linux bl460cx2378 2.6.32-431.5.1.el6.x86_64 #1 SMP Fri Jan 10 14:46:43
>>>>> EST 2014 x86_64 x86_64 x86_64 GNU/Linux
>>>>>
>>>>>
>>>>> *storm.yaml* which we are using to launch  nimbus, supervisor and ui
>>>>>
>>>>> ########## These MUST be filled in for a storm configuration
>>>>>  storm.zookeeper.servers:
>>>>>      - "10.61.244.86"
>>>>>  storm.zookeeper.port: 2000
>>>>>  supervisor.slots.ports:
>>>>>     - 6788
>>>>>     - 6789
>>>>>     - 6800
>>>>>     - 6801
>>>>>     - 6802
>>>>>      - 6803
>>>>>
>>>>>  nimbus.host: "10.61.244.86"
>>>>>
>>>>>
>>>>>  storm.messaging.transport: "backtype.storm.messaging.netty.Context"
>>>>>
>>>>>  storm.messaging.netty.server_worker_threads: 10
>>>>>  storm.messaging.netty.client_worker_threads: 10
>>>>>  storm.messaging.netty.buffer_size: 5242880
>>>>>  storm.messaging.netty.max_retries: 100
>>>>>  storm.messaging.netty.max_wait_ms: 1000
>>>>>  storm.messaging.netty.min_wait_ms: 100
>>>>>  storm.local.dir: "/root/home_98/home/enavgoy/storm-local"
>>>>>  storm.scheduler: "com.ericsson.storm.scheduler.TopologyScheduler"
>>>>>  topology.acker.executors: 1
>>>>>  topology.message.timeout.secs: 30
>>>>>  supervisor.scheduler.meta:
>>>>>       name: "supervisor1"
>>>>>
>>>>>
>>>>>  worker.childopts: "-Xmx2048m"
>>>>>
>>>>>  mm.hdfs.ipaddress: "10.61.244.7"
>>>>>  mm.hdfs.port: 9000
>>>>>  topology.batch.size: 2
>>>>>  topology.batch.timeout: 10000
>>>>>  topology.workers: 2
>>>>>  topology.debug: true
>>>>>
>>>>> Regards
>>>>> Tarkeshwar
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jul 7, 2014 at 1:22 PM, Danijel Schiavuzzi <
>>>>> danijel@schiavuzzi.com> wrote:
>>>>>
>>>>>> Hi Tarkeshwar,
>>>>>>
>>>>>> Could you provide a code sample of your topology? Do you have any
>>>>>> special configs enabled?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Danijel
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 7, 2014 at 9:01 AM, M.Tarkeshwar Rao <
>>>>>> tarkeshwar4u@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Danijel,
>>>>>>>
>>>>>>> We are able to reproduce this issue with 0.9.2 as well.
>>>>>>> We have two worker setup to run the trident topology.
>>>>>>>
>>>>>>> When we kill one of the worker and again when that killed worker
>>>>>>> spawn on same port(same slot) then that worker not able to communicate with
>>>>>>> 2nd worker.
>>>>>>>
>>>>>>> only transaction attempts are increasing continuously.
>>>>>>>
>>>>>>> But if the killed worker spawn on new slot(new communication port)
>>>>>>> then it working fine. Same behavior as in storm 9.0.1.
>>>>>>>
>>>>>>> Please update me if you get any new development.
>>>>>>>
>>>>>>> Regards
>>>>>>> Tarkeshwar
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 3, 2014 at 7:06 PM, Danijel Schiavuzzi <
>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>
>>>>>>>> Hi Bobby,
>>>>>>>>
>>>>>>>> Just an update on the stuck Trident transactional topology issue --
>>>>>>>> I've upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and can't
>>>>>>>> reproduce the bug anymore. Will keep you posted if any issues arise.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Danijel
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <ev...@yahoo-inc.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>  I have not seen this before, if you could file a JIRA on this
>>>>>>>>> that would be great.
>>>>>>>>>
>>>>>>>>>  - Bobby
>>>>>>>>>
>>>>>>>>>   From: Danijel Schiavuzzi <da...@schiavuzzi.com>
>>>>>>>>> Reply-To: "user@storm.incubator.apache.org" <
>>>>>>>>> user@storm.incubator.apache.org>
>>>>>>>>> Date: Wednesday, June 4, 2014 at 10:30 AM
>>>>>>>>> To: "user@storm.incubator.apache.org" <
>>>>>>>>> user@storm.incubator.apache.org>, "dev@storm.incubator.apache.org"
>>>>>>>>> <de...@storm.incubator.apache.org>
>>>>>>>>> Subject: Trident transactional topology stuck re-emitting batches
>>>>>>>>> with Netty, but running fine with ZMQ (was Re: Topology is stuck)
>>>>>>>>>
>>>>>>>>>   Hi all,
>>>>>>>>>
>>>>>>>>> I've managed to reproduce the stuck topology problem and it seems
>>>>>>>>> it's due to the Netty transport. Running with ZMQ transport enabled now and
>>>>>>>>> I haven't been able to reproduce this.
>>>>>>>>>
>>>>>>>>>  The problem is basically a Trident/Kafka transactional topology
>>>>>>>>> getting stuck, i.e. re-emitting the same batches over and over again. This
>>>>>>>>> happens after the Storm workers restart a few times due to Kafka spout
>>>>>>>>> throwing RuntimeExceptions (because of the Kafka consumer in the spout
>>>>>>>>> timing out with a SocketTimeoutException due to some temporary network
>>>>>>>>> problems). Sometimes the topology is stuck after just one worker is
>>>>>>>>> restarted, and sometimes a few worker restarts are needed to trigger the
>>>>>>>>> problem.
>>>>>>>>>
>>>>>>>>> I simulated the Kafka spout socket timeouts by blocking network
>>>>>>>>> access from Storm to my Kafka machines (with an iptables firewall rule).
>>>>>>>>> Most of the time the spouts (workers) would restart normally (after
>>>>>>>>> re-enabling access to Kafka) and the topology would continue to process
>>>>>>>>> batches, but sometimes the topology would get stuck re-emitting batches
>>>>>>>>> after the crashed workers restarted. Killing and re-submitting the topology
>>>>>>>>> manually fixes this always, and processing continues normally.
>>>>>>>>>
>>>>>>>>>  I haven't been able to reproduce this scenario after reverting
>>>>>>>>> my Storm cluster's transport to ZeroMQ. With Netty transport, I can almost
>>>>>>>>> always reproduce the problem by causing a worker to restart a number of
>>>>>>>>> times (only about 4-5 worker restarts are enough to trigger this).
>>>>>>>>>
>>>>>>>>>  Any hints on this? Anyone had the same problem? It does seem a
>>>>>>>>> serious issue as it affect the reliability and fault tolerance of the Storm
>>>>>>>>> cluster.
>>>>>>>>>
>>>>>>>>>  In the meantime, I'll try to prepare a reproducible test case
>>>>>>>>> for this.
>>>>>>>>>
>>>>>>>>>  Thanks,
>>>>>>>>>
>>>>>>>>> Danijel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi <
>>>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>>>
>>>>>>>>>> To (partially) answer my own question -- I still have no idea on
>>>>>>>>>> the cause of the stuck topology, but re-submitting the topology helps --
>>>>>>>>>> after re-submitting my topology is now running normally.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
>>>>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>  Also, I did have multiple cases of my IBackingMap workers
>>>>>>>>>>> dying (because of RuntimeExceptions) but successfully restarting afterwards
>>>>>>>>>>> (I throw RuntimeExceptions in the BackingMap implementation as my strategy
>>>>>>>>>>> in rare SQL database deadlock situations to force a worker restart and to
>>>>>>>>>>> fail+retry the batch).
>>>>>>>>>>>
>>>>>>>>>>>  From the logs, one such IBackingMap worker death (and
>>>>>>>>>>> subsequent restart) resulted in the Kafka spout re-emitting the pending
>>>>>>>>>>> tuple:
>>>>>>>>>>>
>>>>>>>>>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO]
>>>>>>>>>>> re-emitting batch, attempt 29698959:736
>>>>>>>>>>>
>>>>>>>>>>>  This is of course the normal behavior of a transactional
>>>>>>>>>>> topology, but this is the first time I've encountered a case of a batch
>>>>>>>>>>> retrying indefinitely. This is especially suspicious since the topology has
>>>>>>>>>>> been running fine for 20 days straight, re-emitting batches and restarting
>>>>>>>>>>> IBackingMap workers quite a number of times.
>>>>>>>>>>>
>>>>>>>>>>> I can see in my IBackingMap backing SQL database that the batch
>>>>>>>>>>> with the exact txid value 29698959 has been committed -- but I suspect that
>>>>>>>>>>> could come from another BackingMap, since there are two BackingMap
>>>>>>>>>>> instances running (paralellismHint 2).
>>>>>>>>>>>
>>>>>>>>>>>  However, I have no idea why the batch is being retried
>>>>>>>>>>> indefinitely now nor why it hasn't been successfully acked by Trident.
>>>>>>>>>>>
>>>>>>>>>>> Any suggestions on the area (topology component) to focus my
>>>>>>>>>>> research on?
>>>>>>>>>>>
>>>>>>>>>>>  Thanks,
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>>>>>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>   Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> I'm having problems with my transactional Trident topology. It
>>>>>>>>>>>> has been running fine for about 20 days, and suddenly is stuck processing a
>>>>>>>>>>>> single batch, with no tuples being emitted nor tuples being persisted by
>>>>>>>>>>>> the TridentState (IBackingMap).
>>>>>>>>>>>>
>>>>>>>>>>>> It's a simple topology which consumes messages off a Kafka
>>>>>>>>>>>> queue. The spout is an instance of storm-kafka-0.8-plus
>>>>>>>>>>>> TransactionalTridentKafkaSpout and I use the trident-mssql transactional
>>>>>>>>>>>> TridentState implementation to persistentAggregate() data into a SQL
>>>>>>>>>>>> database.
>>>>>>>>>>>>
>>>>>>>>>>>>  In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>>>>>>>>>>
>>>>>>>>>>>>      "/transactional/<myTopologyName>/coordinator/currattempts"
>>>>>>>>>>>> is "{"29698959":6487}"
>>>>>>>>>>>>
>>>>>>>>>>>> ... and the attempt count keeps increasing. It seems the batch
>>>>>>>>>>>> with txid 29698959 is stuck, as the attempt count in Zookeeper keeps
>>>>>>>>>>>> increasing -- seems like the batch isn't being acked by Trident and I have
>>>>>>>>>>>> no idea why, especially since the topology has been running successfully
>>>>>>>>>>>> the last 20 days.
>>>>>>>>>>>>
>>>>>>>>>>>>  I did rebalance the topology on one occasion, after which it
>>>>>>>>>>>> continued running normally. Other than that, no other modifications were
>>>>>>>>>>>> done. Storm is at version 0.9.0.1.
>>>>>>>>>>>>
>>>>>>>>>>>>  Any hints on how to debug the stuck topology? Any other
>>>>>>>>>>>> useful info I might provide?
>>>>>>>>>>>>
>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>>>
>>>>>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>>>> T: +385989035562
>>>>>>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>>
>>>>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>>> T: +385989035562
>>>>>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>
>>>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>> T: +385989035562
>>>>>>>>>>  Skype: danijels7
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>
>>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>> T: +385989035562
>>>>>>>>> Skype: danijels7
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Danijel Schiavuzzi
>>>>>>>>
>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>> W: www.schiavuzzi.com
>>>>>>>> T: +385989035562
>>>>>>>> Skype: danijels7
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Danijel Schiavuzzi
>>>>>>
>>>>>> E: danijel@schiavuzzi.com
>>>>>> W: www.schiavuzzi.com
>>>>>> T: +385989035562
>>>>>> Skype: danijels7
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Danijel Schiavuzzi
>>>>
>>>> E: danijel@schiavuzzi.com
>>>> <javascript:_e(%7B%7D,'cvml','danijel@schiavuzzi.com');>
>>>> W: www.schiavuzzi.com
>>>> T: +385989035562
>>>> Skype: danijels7
>>>>
>>>
>>>
>>
>>
>> --
>> Danijel Schiavuzzi
>>
>> E: danijel@schiavuzzi.com
>> <javascript:_e(%7B%7D,'cvml','danijel@schiavuzzi.com');>
>> W: www.schiavuzzi.com
>> T: +385989035562
>> Skype: danijels7
>>
>
>

-- 
Danijel Schiavuzzi

E: danijel@schiavuzzi.com
W: www.schiavuzzi.com
T: +385 98 9035562
Skype: danijel.schiavuzzi

Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Posted by "M.Tarkeshwar Rao" <ta...@gmail.com>.

Hi Danijel,

Is the issue resolved in any version of the storm?

Regards
Tarkeshwar

On Thu, Jul 17, 2014 at 6:57 PM, Danijel Schiavuzzi <da...@schiavuzzi.com>
wrote:

> I've filled a bug report for this under
> https://issues.apache.org/jira/browse/STORM-406
>
> The issue is 100% reproducible with, it seems, any Trident topology and
> across multiple Storm versions with Netty transport enabled. 0MQ is working
> fine. You can try with TridentWordCount from storm-starter, for example.
>
> Your insight seems correct: when the killed worker re-spawns on the same
> slot (port), the topology stops processing. See the above JIRA for
> additional info.
>
> Danijel
>
>
>
>
> On Thu, Jul 17, 2014 at 7:20 AM, M.Tarkeshwar Rao <ta...@gmail.com>
> wrote:
>
>> Thanks Danijel for helping me.
>>
>>
>> On Thu, Jul 17, 2014 at 1:37 AM, Danijel Schiavuzzi <
>> danijel@schiavuzzi.com> wrote:
>>
>>> I see no issues with your cluster configuration.
>>>
>>> You should definitely share the (simplified if possible) topology
>>> code and the steps to reproduce the blockage, better yet you should file a
>>> JIRA task on Apache's JIRA web -- be sure to include your Trident
>>> internals modifications.
>>>
>>> Unfortunately, seems I'm having the same issues now with Storm 0.9.2
>>> too, so I might get back here with some updates soon. It's not so fast
>>> and easily reproducible as it was under 0.9.1, but the bug
>>> seems nonetheless still present. I'll reduce the number of Storm slots and
>>> topology workers as per your insights, hopefully this might make it easier
>>> to reproduce the bug with a simplified Trident topology.
>>>
>>>
>>> On Tuesday, July 15, 2014, M.Tarkeshwar Rao <ta...@gmail.com>
>>> wrote:
>>>
>>>> Hi Denijel,
>>>>
>>>> We have done few changes in the the trident core framework code as per
>>>> our need which is working fine with zeromq. I am sharing configuration
>>>> which we are using. Can you please suggest our config is fine or not?
>>>>
>>>>  Code part is so large so we are writing some sample topology and
>>>> trying to reproduce the issue, which we will share with you.
>>>>
>>>> What are the steps to reproduce the issue:
>>>>  -------------------------------------------------------------
>>>>
>>>> 1. we deployed our topology with one linux machine, two workers and one
>>>> acker with batch size 2.
>>>> 2. both the worker are up and start the processing.
>>>> 3. after few seconds i killed one of the worker kill -9.
>>>> 4. when the killed worker spawned on the same port it is getting hanged.
>>>> 5. only retries going on.
>>>> 6. when the killed worker spawned on the another port everything
>>>> working fine.
>>>>
>>>> machine conf:
>>>> --------------------------
>>>> [root@sb6270x1637-2 conf]# uname -a
>>>>
>>>> Linux bl460cx2378 2.6.32-431.5.1.el6.x86_64 #1 SMP Fri Jan 10 14:46:43
>>>> EST 2014 x86_64 x86_64 x86_64 GNU/Linux
>>>>
>>>>
>>>> *storm.yaml* which we are using to launch  nimbus, supervisor and ui
>>>>
>>>> ########## These MUST be filled in for a storm configuration
>>>>  storm.zookeeper.servers:
>>>>      - "10.61.244.86"
>>>>  storm.zookeeper.port: 2000
>>>>  supervisor.slots.ports:
>>>>     - 6788
>>>>     - 6789
>>>>     - 6800
>>>>     - 6801
>>>>     - 6802
>>>>      - 6803
>>>>
>>>>  nimbus.host: "10.61.244.86"
>>>>
>>>>
>>>>  storm.messaging.transport: "backtype.storm.messaging.netty.Context"
>>>>
>>>>  storm.messaging.netty.server_worker_threads: 10
>>>>  storm.messaging.netty.client_worker_threads: 10
>>>>  storm.messaging.netty.buffer_size: 5242880
>>>>  storm.messaging.netty.max_retries: 100
>>>>  storm.messaging.netty.max_wait_ms: 1000
>>>>  storm.messaging.netty.min_wait_ms: 100
>>>>  storm.local.dir: "/root/home_98/home/enavgoy/storm-local"
>>>>  storm.scheduler: "com.ericsson.storm.scheduler.TopologyScheduler"
>>>>  topology.acker.executors: 1
>>>>  topology.message.timeout.secs: 30
>>>>  supervisor.scheduler.meta:
>>>>       name: "supervisor1"
>>>>
>>>>
>>>>  worker.childopts: "-Xmx2048m"
>>>>
>>>>  mm.hdfs.ipaddress: "10.61.244.7"
>>>>  mm.hdfs.port: 9000
>>>>  topology.batch.size: 2
>>>>  topology.batch.timeout: 10000
>>>>  topology.workers: 2
>>>>  topology.debug: true
>>>>
>>>> Regards
>>>> Tarkeshwar
>>>>
>>>>
>>>>
>>>> On Mon, Jul 7, 2014 at 1:22 PM, Danijel Schiavuzzi <
>>>> danijel@schiavuzzi.com> wrote:
>>>>
>>>>> Hi Tarkeshwar,
>>>>>
>>>>> Could you provide a code sample of your topology? Do you have any
>>>>> special configs enabled?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Danijel
>>>>>
>>>>>
>>>>> On Mon, Jul 7, 2014 at 9:01 AM, M.Tarkeshwar Rao <
>>>>> tarkeshwar4u@gmail.com> wrote:
>>>>>
>>>>>> Hi Danijel,
>>>>>>
>>>>>> We are able to reproduce this issue with 0.9.2 as well.
>>>>>> We have two worker setup to run the trident topology.
>>>>>>
>>>>>> When we kill one of the worker and again when that killed worker
>>>>>> spawn on same port(same slot) then that worker not able to communicate with
>>>>>> 2nd worker.
>>>>>>
>>>>>> only transaction attempts are increasing continuously.
>>>>>>
>>>>>> But if the killed worker spawn on new slot(new communication port)
>>>>>> then it working fine. Same behavior as in storm 9.0.1.
>>>>>>
>>>>>> Please update me if you get any new development.
>>>>>>
>>>>>> Regards
>>>>>> Tarkeshwar
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 3, 2014 at 7:06 PM, Danijel Schiavuzzi <
>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>
>>>>>>> Hi Bobby,
>>>>>>>
>>>>>>> Just an update on the stuck Trident transactional topology issue --
>>>>>>> I've upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and can't
>>>>>>> reproduce the bug anymore. Will keep you posted if any issues arise.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Danijel
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <ev...@yahoo-inc.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>  I have not seen this before, if you could file a JIRA on this
>>>>>>>> that would be great.
>>>>>>>>
>>>>>>>>  - Bobby
>>>>>>>>
>>>>>>>>   From: Danijel Schiavuzzi <da...@schiavuzzi.com>
>>>>>>>> Reply-To: "user@storm.incubator.apache.org" <
>>>>>>>> user@storm.incubator.apache.org>
>>>>>>>> Date: Wednesday, June 4, 2014 at 10:30 AM
>>>>>>>> To: "user@storm.incubator.apache.org" <
>>>>>>>> user@storm.incubator.apache.org>, "dev@storm.incubator.apache.org"
>>>>>>>> <de...@storm.incubator.apache.org>
>>>>>>>> Subject: Trident transactional topology stuck re-emitting batches
>>>>>>>> with Netty, but running fine with ZMQ (was Re: Topology is stuck)
>>>>>>>>
>>>>>>>>   Hi all,
>>>>>>>>
>>>>>>>> I've managed to reproduce the stuck topology problem and it seems
>>>>>>>> it's due to the Netty transport. Running with ZMQ transport enabled now and
>>>>>>>> I haven't been able to reproduce this.
>>>>>>>>
>>>>>>>>  The problem is basically a Trident/Kafka transactional topology
>>>>>>>> getting stuck, i.e. re-emitting the same batches over and over again. This
>>>>>>>> happens after the Storm workers restart a few times due to Kafka spout
>>>>>>>> throwing RuntimeExceptions (because of the Kafka consumer in the spout
>>>>>>>> timing out with a SocketTimeoutException due to some temporary network
>>>>>>>> problems). Sometimes the topology is stuck after just one worker is
>>>>>>>> restarted, and sometimes a few worker restarts are needed to trigger the
>>>>>>>> problem.
>>>>>>>>
>>>>>>>> I simulated the Kafka spout socket timeouts by blocking network
>>>>>>>> access from Storm to my Kafka machines (with an iptables firewall rule).
>>>>>>>> Most of the time the spouts (workers) would restart normally (after
>>>>>>>> re-enabling access to Kafka) and the topology would continue to process
>>>>>>>> batches, but sometimes the topology would get stuck re-emitting batches
>>>>>>>> after the crashed workers restarted. Killing and re-submitting the topology
>>>>>>>> manually fixes this always, and processing continues normally.
>>>>>>>>
>>>>>>>>  I haven't been able to reproduce this scenario after reverting my
>>>>>>>> Storm cluster's transport to ZeroMQ. With Netty transport, I can almost
>>>>>>>> always reproduce the problem by causing a worker to restart a number of
>>>>>>>> times (only about 4-5 worker restarts are enough to trigger this).
>>>>>>>>
>>>>>>>>  Any hints on this? Anyone had the same problem? It does seem a
>>>>>>>> serious issue as it affect the reliability and fault tolerance of the Storm
>>>>>>>> cluster.
>>>>>>>>
>>>>>>>>  In the meantime, I'll try to prepare a reproducible test case for
>>>>>>>> this.
>>>>>>>>
>>>>>>>>  Thanks,
>>>>>>>>
>>>>>>>> Danijel
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi <
>>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>>
>>>>>>>>> To (partially) answer my own question -- I still have no idea on
>>>>>>>>> the cause of the stuck topology, but re-submitting the topology helps --
>>>>>>>>> after re-submitting my topology is now running normally.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
>>>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>>>
>>>>>>>>>>  Also, I did have multiple cases of my IBackingMap workers dying
>>>>>>>>>> (because of RuntimeExceptions) but successfully restarting afterwards (I
>>>>>>>>>> throw RuntimeExceptions in the BackingMap implementation as my strategy in
>>>>>>>>>> rare SQL database deadlock situations to force a worker restart and to
>>>>>>>>>> fail+retry the batch).
>>>>>>>>>>
>>>>>>>>>>  From the logs, one such IBackingMap worker death (and
>>>>>>>>>> subsequent restart) resulted in the Kafka spout re-emitting the pending
>>>>>>>>>> tuple:
>>>>>>>>>>
>>>>>>>>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO]
>>>>>>>>>> re-emitting batch, attempt 29698959:736
>>>>>>>>>>
>>>>>>>>>>  This is of course the normal behavior of a transactional
>>>>>>>>>> topology, but this is the first time I've encountered a case of a batch
>>>>>>>>>> retrying indefinitely. This is especially suspicious since the topology has
>>>>>>>>>> been running fine for 20 days straight, re-emitting batches and restarting
>>>>>>>>>> IBackingMap workers quite a number of times.
>>>>>>>>>>
>>>>>>>>>> I can see in my IBackingMap backing SQL database that the batch
>>>>>>>>>> with the exact txid value 29698959 has been committed -- but I suspect that
>>>>>>>>>> could come from another BackingMap, since there are two BackingMap
>>>>>>>>>> instances running (paralellismHint 2).
>>>>>>>>>>
>>>>>>>>>>  However, I have no idea why the batch is being retried
>>>>>>>>>> indefinitely now nor why it hasn't been successfully acked by Trident.
>>>>>>>>>>
>>>>>>>>>> Any suggestions on the area (topology component) to focus my
>>>>>>>>>> research on?
>>>>>>>>>>
>>>>>>>>>>  Thanks,
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>>>>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>   Hello,
>>>>>>>>>>>
>>>>>>>>>>> I'm having problems with my transactional Trident topology. It
>>>>>>>>>>> has been running fine for about 20 days, and suddenly is stuck processing a
>>>>>>>>>>> single batch, with no tuples being emitted nor tuples being persisted by
>>>>>>>>>>> the TridentState (IBackingMap).
>>>>>>>>>>>
>>>>>>>>>>> It's a simple topology which consumes messages off a Kafka
>>>>>>>>>>> queue. The spout is an instance of storm-kafka-0.8-plus
>>>>>>>>>>> TransactionalTridentKafkaSpout and I use the trident-mssql transactional
>>>>>>>>>>> TridentState implementation to persistentAggregate() data into a SQL
>>>>>>>>>>> database.
>>>>>>>>>>>
>>>>>>>>>>>  In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>>>>>>>>>
>>>>>>>>>>>      "/transactional/<myTopologyName>/coordinator/currattempts"
>>>>>>>>>>> is "{"29698959":6487}"
>>>>>>>>>>>
>>>>>>>>>>> ... and the attempt count keeps increasing. It seems the batch
>>>>>>>>>>> with txid 29698959 is stuck, as the attempt count in Zookeeper keeps
>>>>>>>>>>> increasing -- seems like the batch isn't being acked by Trident and I have
>>>>>>>>>>> no idea why, especially since the topology has been running successfully
>>>>>>>>>>> the last 20 days.
>>>>>>>>>>>
>>>>>>>>>>>  I did rebalance the topology on one occasion, after which it
>>>>>>>>>>> continued running normally. Other than that, no other modifications were
>>>>>>>>>>> done. Storm is at version 0.9.0.1.
>>>>>>>>>>>
>>>>>>>>>>>  Any hints on how to debug the stuck topology? Any other useful
>>>>>>>>>>> info I might provide?
>>>>>>>>>>>
>>>>>>>>>>>  Thanks,
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>>
>>>>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>>> T: +385989035562
>>>>>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>
>>>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>> T: +385989035562
>>>>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>
>>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>> T: +385989035562
>>>>>>>>>  Skype: danijels7
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Danijel Schiavuzzi
>>>>>>>>
>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>> W: www.schiavuzzi.com
>>>>>>>> T: +385989035562
>>>>>>>> Skype: danijels7
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Danijel Schiavuzzi
>>>>>>>
>>>>>>> E: danijel@schiavuzzi.com
>>>>>>> W: www.schiavuzzi.com
>>>>>>> T: +385989035562
>>>>>>> Skype: danijels7
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Danijel Schiavuzzi
>>>>>
>>>>> E: danijel@schiavuzzi.com
>>>>> W: www.schiavuzzi.com
>>>>> T: +385989035562
>>>>> Skype: danijels7
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Danijel Schiavuzzi
>>>
>>> E: danijel@schiavuzzi.com
>>> W: www.schiavuzzi.com
>>> T: +385989035562
>>> Skype: danijels7
>>>
>>
>>
>
>
> --
> Danijel Schiavuzzi
>
> E: danijel@schiavuzzi.com
> W: www.schiavuzzi.com
> T: +385989035562
> Skype: danijels7
>

Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Posted by Danijel Schiavuzzi <da...@schiavuzzi.com>.

I've filled a bug report for this under
https://issues.apache.org/jira/browse/STORM-406

The issue is 100% reproducible with, it seems, any Trident topology and
across multiple Storm versions with Netty transport enabled. 0MQ is working
fine. You can try with TridentWordCount from storm-starter, for example.

Your insight seems correct: when the killed worker re-spawns on the same
slot (port), the topology stops processing. See the above JIRA for
additional info.

Danijel




On Thu, Jul 17, 2014 at 7:20 AM, M.Tarkeshwar Rao <ta...@gmail.com>
wrote:

> Thanks Danijel for helping me.
>
>
> On Thu, Jul 17, 2014 at 1:37 AM, Danijel Schiavuzzi <
> danijel@schiavuzzi.com> wrote:
>
>> I see no issues with your cluster configuration.
>>
>> You should definitely share the (simplified if possible) topology
>> code and the steps to reproduce the blockage, better yet you should file a
>> JIRA task on Apache's JIRA web -- be sure to include your Trident
>> internals modifications.
>>
>> Unfortunately, seems I'm having the same issues now with Storm 0.9.2
>> too, so I might get back here with some updates soon. It's not so fast
>> and easily reproducible as it was under 0.9.1, but the bug
>> seems nonetheless still present. I'll reduce the number of Storm slots and
>> topology workers as per your insights, hopefully this might make it easier
>> to reproduce the bug with a simplified Trident topology.
>>
>>
>> On Tuesday, July 15, 2014, M.Tarkeshwar Rao <ta...@gmail.com>
>> wrote:
>>
>>> Hi Denijel,
>>>
>>> We have done few changes in the the trident core framework code as per
>>> our need which is working fine with zeromq. I am sharing configuration
>>> which we are using. Can you please suggest our config is fine or not?
>>>
>>>  Code part is so large so we are writing some sample topology and trying
>>> to reproduce the issue, which we will share with you.
>>>
>>> What are the steps to reproduce the issue:
>>>  -------------------------------------------------------------
>>>
>>> 1. we deployed our topology with one linux machine, two workers and one
>>> acker with batch size 2.
>>> 2. both the worker are up and start the processing.
>>> 3. after few seconds i killed one of the worker kill -9.
>>> 4. when the killed worker spawned on the same port it is getting hanged.
>>> 5. only retries going on.
>>> 6. when the killed worker spawned on the another port everything working
>>> fine.
>>>
>>> machine conf:
>>> --------------------------
>>> [root@sb6270x1637-2 conf]# uname -a
>>>
>>> Linux bl460cx2378 2.6.32-431.5.1.el6.x86_64 #1 SMP Fri Jan 10 14:46:43
>>> EST 2014 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>>
>>> *storm.yaml* which we are using to launch  nimbus, supervisor and ui
>>>
>>> ########## These MUST be filled in for a storm configuration
>>>  storm.zookeeper.servers:
>>>      - "10.61.244.86"
>>>  storm.zookeeper.port: 2000
>>>  supervisor.slots.ports:
>>>     - 6788
>>>     - 6789
>>>     - 6800
>>>     - 6801
>>>     - 6802
>>>      - 6803
>>>
>>>  nimbus.host: "10.61.244.86"
>>>
>>>
>>>  storm.messaging.transport: "backtype.storm.messaging.netty.Context"
>>>
>>>  storm.messaging.netty.server_worker_threads: 10
>>>  storm.messaging.netty.client_worker_threads: 10
>>>  storm.messaging.netty.buffer_size: 5242880
>>>  storm.messaging.netty.max_retries: 100
>>>  storm.messaging.netty.max_wait_ms: 1000
>>>  storm.messaging.netty.min_wait_ms: 100
>>>  storm.local.dir: "/root/home_98/home/enavgoy/storm-local"
>>>  storm.scheduler: "com.ericsson.storm.scheduler.TopologyScheduler"
>>>  topology.acker.executors: 1
>>>  topology.message.timeout.secs: 30
>>>  supervisor.scheduler.meta:
>>>       name: "supervisor1"
>>>
>>>
>>>  worker.childopts: "-Xmx2048m"
>>>
>>>  mm.hdfs.ipaddress: "10.61.244.7"
>>>  mm.hdfs.port: 9000
>>>  topology.batch.size: 2
>>>  topology.batch.timeout: 10000
>>>  topology.workers: 2
>>>  topology.debug: true
>>>
>>> Regards
>>> Tarkeshwar
>>>
>>>
>>>
>>> On Mon, Jul 7, 2014 at 1:22 PM, Danijel Schiavuzzi <
>>> danijel@schiavuzzi.com> wrote:
>>>
>>>> Hi Tarkeshwar,
>>>>
>>>> Could you provide a code sample of your topology? Do you have any
>>>> special configs enabled?
>>>>
>>>> Thanks,
>>>>
>>>> Danijel
>>>>
>>>>
>>>> On Mon, Jul 7, 2014 at 9:01 AM, M.Tarkeshwar Rao <
>>>> tarkeshwar4u@gmail.com> wrote:
>>>>
>>>>> Hi Danijel,
>>>>>
>>>>> We are able to reproduce this issue with 0.9.2 as well.
>>>>> We have two worker setup to run the trident topology.
>>>>>
>>>>> When we kill one of the worker and again when that killed worker spawn
>>>>> on same port(same slot) then that worker not able to communicate with 2nd
>>>>> worker.
>>>>>
>>>>> only transaction attempts are increasing continuously.
>>>>>
>>>>> But if the killed worker spawn on new slot(new communication port)
>>>>> then it working fine. Same behavior as in storm 9.0.1.
>>>>>
>>>>> Please update me if you get any new development.
>>>>>
>>>>> Regards
>>>>> Tarkeshwar
>>>>>
>>>>>
>>>>> On Thu, Jul 3, 2014 at 7:06 PM, Danijel Schiavuzzi <
>>>>> danijel@schiavuzzi.com> wrote:
>>>>>
>>>>>> Hi Bobby,
>>>>>>
>>>>>> Just an update on the stuck Trident transactional topology issue --
>>>>>> I've upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and can't
>>>>>> reproduce the bug anymore. Will keep you posted if any issues arise.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Danijel
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <ev...@yahoo-inc.com>
>>>>>> wrote:
>>>>>>
>>>>>>>  I have not seen this before, if you could file a JIRA on this that
>>>>>>> would be great.
>>>>>>>
>>>>>>>  - Bobby
>>>>>>>
>>>>>>>   From: Danijel Schiavuzzi <da...@schiavuzzi.com>
>>>>>>> Reply-To: "user@storm.incubator.apache.org" <
>>>>>>> user@storm.incubator.apache.org>
>>>>>>> Date: Wednesday, June 4, 2014 at 10:30 AM
>>>>>>> To: "user@storm.incubator.apache.org" <
>>>>>>> user@storm.incubator.apache.org>, "dev@storm.incubator.apache.org" <
>>>>>>> dev@storm.incubator.apache.org>
>>>>>>> Subject: Trident transactional topology stuck re-emitting batches
>>>>>>> with Netty, but running fine with ZMQ (was Re: Topology is stuck)
>>>>>>>
>>>>>>>   Hi all,
>>>>>>>
>>>>>>> I've managed to reproduce the stuck topology problem and it seems
>>>>>>> it's due to the Netty transport. Running with ZMQ transport enabled now and
>>>>>>> I haven't been able to reproduce this.
>>>>>>>
>>>>>>>  The problem is basically a Trident/Kafka transactional topology
>>>>>>> getting stuck, i.e. re-emitting the same batches over and over again. This
>>>>>>> happens after the Storm workers restart a few times due to Kafka spout
>>>>>>> throwing RuntimeExceptions (because of the Kafka consumer in the spout
>>>>>>> timing out with a SocketTimeoutException due to some temporary network
>>>>>>> problems). Sometimes the topology is stuck after just one worker is
>>>>>>> restarted, and sometimes a few worker restarts are needed to trigger the
>>>>>>> problem.
>>>>>>>
>>>>>>> I simulated the Kafka spout socket timeouts by blocking network
>>>>>>> access from Storm to my Kafka machines (with an iptables firewall rule).
>>>>>>> Most of the time the spouts (workers) would restart normally (after
>>>>>>> re-enabling access to Kafka) and the topology would continue to process
>>>>>>> batches, but sometimes the topology would get stuck re-emitting batches
>>>>>>> after the crashed workers restarted. Killing and re-submitting the topology
>>>>>>> manually fixes this always, and processing continues normally.
>>>>>>>
>>>>>>>  I haven't been able to reproduce this scenario after reverting my
>>>>>>> Storm cluster's transport to ZeroMQ. With Netty transport, I can almost
>>>>>>> always reproduce the problem by causing a worker to restart a number of
>>>>>>> times (only about 4-5 worker restarts are enough to trigger this).
>>>>>>>
>>>>>>>  Any hints on this? Anyone had the same problem? It does seem a
>>>>>>> serious issue as it affect the reliability and fault tolerance of the Storm
>>>>>>> cluster.
>>>>>>>
>>>>>>>  In the meantime, I'll try to prepare a reproducible test case for
>>>>>>> this.
>>>>>>>
>>>>>>>  Thanks,
>>>>>>>
>>>>>>> Danijel
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi <
>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>
>>>>>>>> To (partially) answer my own question -- I still have no idea on
>>>>>>>> the cause of the stuck topology, but re-submitting the topology helps --
>>>>>>>> after re-submitting my topology is now running normally.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
>>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>>
>>>>>>>>>  Also, I did have multiple cases of my IBackingMap workers dying
>>>>>>>>> (because of RuntimeExceptions) but successfully restarting afterwards (I
>>>>>>>>> throw RuntimeExceptions in the BackingMap implementation as my strategy in
>>>>>>>>> rare SQL database deadlock situations to force a worker restart and to
>>>>>>>>> fail+retry the batch).
>>>>>>>>>
>>>>>>>>>  From the logs, one such IBackingMap worker death (and subsequent
>>>>>>>>> restart) resulted in the Kafka spout re-emitting the pending tuple:
>>>>>>>>>
>>>>>>>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO]
>>>>>>>>> re-emitting batch, attempt 29698959:736
>>>>>>>>>
>>>>>>>>>  This is of course the normal behavior of a transactional
>>>>>>>>> topology, but this is the first time I've encountered a case of a batch
>>>>>>>>> retrying indefinitely. This is especially suspicious since the topology has
>>>>>>>>> been running fine for 20 days straight, re-emitting batches and restarting
>>>>>>>>> IBackingMap workers quite a number of times.
>>>>>>>>>
>>>>>>>>> I can see in my IBackingMap backing SQL database that the batch
>>>>>>>>> with the exact txid value 29698959 has been committed -- but I suspect that
>>>>>>>>> could come from another BackingMap, since there are two BackingMap
>>>>>>>>> instances running (paralellismHint 2).
>>>>>>>>>
>>>>>>>>>  However, I have no idea why the batch is being retried
>>>>>>>>> indefinitely now nor why it hasn't been successfully acked by Trident.
>>>>>>>>>
>>>>>>>>> Any suggestions on the area (topology component) to focus my
>>>>>>>>> research on?
>>>>>>>>>
>>>>>>>>>  Thanks,
>>>>>>>>>
>>>>>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>>>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>>>
>>>>>>>>>>   Hello,
>>>>>>>>>>
>>>>>>>>>> I'm having problems with my transactional Trident topology. It
>>>>>>>>>> has been running fine for about 20 days, and suddenly is stuck processing a
>>>>>>>>>> single batch, with no tuples being emitted nor tuples being persisted by
>>>>>>>>>> the TridentState (IBackingMap).
>>>>>>>>>>
>>>>>>>>>> It's a simple topology which consumes messages off a Kafka queue.
>>>>>>>>>> The spout is an instance of storm-kafka-0.8-plus
>>>>>>>>>> TransactionalTridentKafkaSpout and I use the trident-mssql transactional
>>>>>>>>>> TridentState implementation to persistentAggregate() data into a SQL
>>>>>>>>>> database.
>>>>>>>>>>
>>>>>>>>>>  In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>>>>>>>>
>>>>>>>>>>      "/transactional/<myTopologyName>/coordinator/currattempts"
>>>>>>>>>> is "{"29698959":6487}"
>>>>>>>>>>
>>>>>>>>>> ... and the attempt count keeps increasing. It seems the batch
>>>>>>>>>> with txid 29698959 is stuck, as the attempt count in Zookeeper keeps
>>>>>>>>>> increasing -- seems like the batch isn't being acked by Trident and I have
>>>>>>>>>> no idea why, especially since the topology has been running successfully
>>>>>>>>>> the last 20 days.
>>>>>>>>>>
>>>>>>>>>>  I did rebalance the topology on one occasion, after which it
>>>>>>>>>> continued running normally. Other than that, no other modifications were
>>>>>>>>>> done. Storm is at version 0.9.0.1.
>>>>>>>>>>
>>>>>>>>>>  Any hints on how to debug the stuck topology? Any other useful
>>>>>>>>>> info I might provide?
>>>>>>>>>>
>>>>>>>>>>  Thanks,
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>
>>>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>> T: +385989035562
>>>>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>
>>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>> T: +385989035562
>>>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Danijel Schiavuzzi
>>>>>>>>
>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>> W: www.schiavuzzi.com
>>>>>>>> T: +385989035562
>>>>>>>>  Skype: danijels7
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Danijel Schiavuzzi
>>>>>>>
>>>>>>> E: danijel@schiavuzzi.com
>>>>>>> W: www.schiavuzzi.com
>>>>>>> T: +385989035562
>>>>>>> Skype: danijels7
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Danijel Schiavuzzi
>>>>>>
>>>>>> E: danijel@schiavuzzi.com
>>>>>> W: www.schiavuzzi.com
>>>>>> T: +385989035562
>>>>>> Skype: danijels7
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Danijel Schiavuzzi
>>>>
>>>> E: danijel@schiavuzzi.com
>>>> W: www.schiavuzzi.com
>>>> T: +385989035562
>>>> Skype: danijels7
>>>>
>>>
>>>
>>
>> --
>> Danijel Schiavuzzi
>>
>> E: danijel@schiavuzzi.com
>> W: www.schiavuzzi.com
>> T: +385989035562
>> Skype: danijels7
>>
>
>


-- 
Danijel Schiavuzzi

E: danijel@schiavuzzi.com
W: www.schiavuzzi.com
T: +385989035562
Skype: danijels7

Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Posted by "M.Tarkeshwar Rao" <ta...@gmail.com>.

Thanks Danijel for helping me.


On Thu, Jul 17, 2014 at 1:37 AM, Danijel Schiavuzzi <da...@schiavuzzi.com>
wrote:

> I see no issues with your cluster configuration.
>
> You should definitely share the (simplified if possible) topology code and
> the steps to reproduce the blockage, better yet you should file a JIRA task
> on Apache's JIRA web -- be sure to include your Trident
> internals modifications.
>
> Unfortunately, seems I'm having the same issues now with Storm 0.9.2
> too, so I might get back here with some updates soon. It's not so fast
> and easily reproducible as it was under 0.9.1, but the bug
> seems nonetheless still present. I'll reduce the number of Storm slots and
> topology workers as per your insights, hopefully this might make it easier
> to reproduce the bug with a simplified Trident topology.
>
>
> On Tuesday, July 15, 2014, M.Tarkeshwar Rao <ta...@gmail.com>
> wrote:
>
>> Hi Denijel,
>>
>> We have done few changes in the the trident core framework code as per
>> our need which is working fine with zeromq. I am sharing configuration
>> which we are using. Can you please suggest our config is fine or not?
>>
>>  Code part is so large so we are writing some sample topology and trying
>> to reproduce the issue, which we will share with you.
>>
>> What are the steps to reproduce the issue:
>>  -------------------------------------------------------------
>>
>> 1. we deployed our topology with one linux machine, two workers and one
>> acker with batch size 2.
>> 2. both the worker are up and start the processing.
>> 3. after few seconds i killed one of the worker kill -9.
>> 4. when the killed worker spawned on the same port it is getting hanged.
>> 5. only retries going on.
>> 6. when the killed worker spawned on the another port everything working
>> fine.
>>
>> machine conf:
>> --------------------------
>> [root@sb6270x1637-2 conf]# uname -a
>>
>> Linux bl460cx2378 2.6.32-431.5.1.el6.x86_64 #1 SMP Fri Jan 10 14:46:43
>> EST 2014 x86_64 x86_64 x86_64 GNU/Linux
>>
>>
>> *storm.yaml* which we are using to launch  nimbus, supervisor and ui
>>
>> ########## These MUST be filled in for a storm configuration
>>  storm.zookeeper.servers:
>>      - "10.61.244.86"
>>  storm.zookeeper.port: 2000
>>  supervisor.slots.ports:
>>     - 6788
>>     - 6789
>>     - 6800
>>     - 6801
>>     - 6802
>>      - 6803
>>
>>  nimbus.host: "10.61.244.86"
>>
>>
>>  storm.messaging.transport: "backtype.storm.messaging.netty.Context"
>>
>>  storm.messaging.netty.server_worker_threads: 10
>>  storm.messaging.netty.client_worker_threads: 10
>>  storm.messaging.netty.buffer_size: 5242880
>>  storm.messaging.netty.max_retries: 100
>>  storm.messaging.netty.max_wait_ms: 1000
>>  storm.messaging.netty.min_wait_ms: 100
>>  storm.local.dir: "/root/home_98/home/enavgoy/storm-local"
>>  storm.scheduler: "com.ericsson.storm.scheduler.TopologyScheduler"
>>  topology.acker.executors: 1
>>  topology.message.timeout.secs: 30
>>  supervisor.scheduler.meta:
>>       name: "supervisor1"
>>
>>
>>  worker.childopts: "-Xmx2048m"
>>
>>  mm.hdfs.ipaddress: "10.61.244.7"
>>  mm.hdfs.port: 9000
>>  topology.batch.size: 2
>>  topology.batch.timeout: 10000
>>  topology.workers: 2
>>  topology.debug: true
>>
>> Regards
>> Tarkeshwar
>>
>>
>>
>> On Mon, Jul 7, 2014 at 1:22 PM, Danijel Schiavuzzi <
>> danijel@schiavuzzi.com> wrote:
>>
>>> Hi Tarkeshwar,
>>>
>>> Could you provide a code sample of your topology? Do you have any
>>> special configs enabled?
>>>
>>> Thanks,
>>>
>>> Danijel
>>>
>>>
>>> On Mon, Jul 7, 2014 at 9:01 AM, M.Tarkeshwar Rao <tarkeshwar4u@gmail.com
>>> > wrote:
>>>
>>>> Hi Danijel,
>>>>
>>>> We are able to reproduce this issue with 0.9.2 as well.
>>>> We have two worker setup to run the trident topology.
>>>>
>>>> When we kill one of the worker and again when that killed worker spawn
>>>> on same port(same slot) then that worker not able to communicate with 2nd
>>>> worker.
>>>>
>>>> only transaction attempts are increasing continuously.
>>>>
>>>> But if the killed worker spawn on new slot(new communication port) then
>>>> it working fine. Same behavior as in storm 9.0.1.
>>>>
>>>> Please update me if you get any new development.
>>>>
>>>> Regards
>>>> Tarkeshwar
>>>>
>>>>
>>>> On Thu, Jul 3, 2014 at 7:06 PM, Danijel Schiavuzzi <
>>>> danijel@schiavuzzi.com> wrote:
>>>>
>>>>> Hi Bobby,
>>>>>
>>>>> Just an update on the stuck Trident transactional topology issue --
>>>>> I've upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and can't
>>>>> reproduce the bug anymore. Will keep you posted if any issues arise.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Danijel
>>>>>
>>>>>
>>>>> On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <ev...@yahoo-inc.com>
>>>>> wrote:
>>>>>
>>>>>>  I have not seen this before, if you could file a JIRA on this that
>>>>>> would be great.
>>>>>>
>>>>>>  - Bobby
>>>>>>
>>>>>>   From: Danijel Schiavuzzi <da...@schiavuzzi.com>
>>>>>> Reply-To: "user@storm.incubator.apache.org" <
>>>>>> user@storm.incubator.apache.org>
>>>>>> Date: Wednesday, June 4, 2014 at 10:30 AM
>>>>>> To: "user@storm.incubator.apache.org" <
>>>>>> user@storm.incubator.apache.org>, "dev@storm.incubator.apache.org" <
>>>>>> dev@storm.incubator.apache.org>
>>>>>> Subject: Trident transactional topology stuck re-emitting batches
>>>>>> with Netty, but running fine with ZMQ (was Re: Topology is stuck)
>>>>>>
>>>>>>   Hi all,
>>>>>>
>>>>>> I've managed to reproduce the stuck topology problem and it seems
>>>>>> it's due to the Netty transport. Running with ZMQ transport enabled now and
>>>>>> I haven't been able to reproduce this.
>>>>>>
>>>>>>  The problem is basically a Trident/Kafka transactional topology
>>>>>> getting stuck, i.e. re-emitting the same batches over and over again. This
>>>>>> happens after the Storm workers restart a few times due to Kafka spout
>>>>>> throwing RuntimeExceptions (because of the Kafka consumer in the spout
>>>>>> timing out with a SocketTimeoutException due to some temporary network
>>>>>> problems). Sometimes the topology is stuck after just one worker is
>>>>>> restarted, and sometimes a few worker restarts are needed to trigger the
>>>>>> problem.
>>>>>>
>>>>>> I simulated the Kafka spout socket timeouts by blocking network
>>>>>> access from Storm to my Kafka machines (with an iptables firewall rule).
>>>>>> Most of the time the spouts (workers) would restart normally (after
>>>>>> re-enabling access to Kafka) and the topology would continue to process
>>>>>> batches, but sometimes the topology would get stuck re-emitting batches
>>>>>> after the crashed workers restarted. Killing and re-submitting the topology
>>>>>> manually fixes this always, and processing continues normally.
>>>>>>
>>>>>>  I haven't been able to reproduce this scenario after reverting my
>>>>>> Storm cluster's transport to ZeroMQ. With Netty transport, I can almost
>>>>>> always reproduce the problem by causing a worker to restart a number of
>>>>>> times (only about 4-5 worker restarts are enough to trigger this).
>>>>>>
>>>>>>  Any hints on this? Anyone had the same problem? It does seem a
>>>>>> serious issue as it affect the reliability and fault tolerance of the Storm
>>>>>> cluster.
>>>>>>
>>>>>>  In the meantime, I'll try to prepare a reproducible test case for
>>>>>> this.
>>>>>>
>>>>>>  Thanks,
>>>>>>
>>>>>> Danijel
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi <
>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>
>>>>>>> To (partially) answer my own question -- I still have no idea on the
>>>>>>> cause of the stuck topology, but re-submitting the topology helps -- after
>>>>>>> re-submitting my topology is now running normally.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>
>>>>>>>>  Also, I did have multiple cases of my IBackingMap workers dying
>>>>>>>> (because of RuntimeExceptions) but successfully restarting afterwards (I
>>>>>>>> throw RuntimeExceptions in the BackingMap implementation as my strategy in
>>>>>>>> rare SQL database deadlock situations to force a worker restart and to
>>>>>>>> fail+retry the batch).
>>>>>>>>
>>>>>>>>  From the logs, one such IBackingMap worker death (and subsequent
>>>>>>>> restart) resulted in the Kafka spout re-emitting the pending tuple:
>>>>>>>>
>>>>>>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO]
>>>>>>>> re-emitting batch, attempt 29698959:736
>>>>>>>>
>>>>>>>>  This is of course the normal behavior of a transactional
>>>>>>>> topology, but this is the first time I've encountered a case of a batch
>>>>>>>> retrying indefinitely. This is especially suspicious since the topology has
>>>>>>>> been running fine for 20 days straight, re-emitting batches and restarting
>>>>>>>> IBackingMap workers quite a number of times.
>>>>>>>>
>>>>>>>> I can see in my IBackingMap backing SQL database that the batch
>>>>>>>> with the exact txid value 29698959 has been committed -- but I suspect that
>>>>>>>> could come from another BackingMap, since there are two BackingMap
>>>>>>>> instances running (paralellismHint 2).
>>>>>>>>
>>>>>>>>  However, I have no idea why the batch is being retried
>>>>>>>> indefinitely now nor why it hasn't been successfully acked by Trident.
>>>>>>>>
>>>>>>>> Any suggestions on the area (topology component) to focus my
>>>>>>>> research on?
>>>>>>>>
>>>>>>>>  Thanks,
>>>>>>>>
>>>>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>>
>>>>>>>>>   Hello,
>>>>>>>>>
>>>>>>>>> I'm having problems with my transactional Trident topology. It has
>>>>>>>>> been running fine for about 20 days, and suddenly is stuck processing a
>>>>>>>>> single batch, with no tuples being emitted nor tuples being persisted by
>>>>>>>>> the TridentState (IBackingMap).
>>>>>>>>>
>>>>>>>>> It's a simple topology which consumes messages off a Kafka queue.
>>>>>>>>> The spout is an instance of storm-kafka-0.8-plus
>>>>>>>>> TransactionalTridentKafkaSpout and I use the trident-mssql transactional
>>>>>>>>> TridentState implementation to persistentAggregate() data into a SQL
>>>>>>>>> database.
>>>>>>>>>
>>>>>>>>>  In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>>>>>>>
>>>>>>>>>      "/transactional/<myTopologyName>/coordinator/currattempts" is
>>>>>>>>> "{"29698959":6487}"
>>>>>>>>>
>>>>>>>>> ... and the attempt count keeps increasing. It seems the batch
>>>>>>>>> with txid 29698959 is stuck, as the attempt count in Zookeeper keeps
>>>>>>>>> increasing -- seems like the batch isn't being acked by Trident and I have
>>>>>>>>> no idea why, especially since the topology has been running successfully
>>>>>>>>> the last 20 days.
>>>>>>>>>
>>>>>>>>>  I did rebalance the topology on one occasion, after which it
>>>>>>>>> continued running normally. Other than that, no other modifications were
>>>>>>>>> done. Storm is at version 0.9.0.1.
>>>>>>>>>
>>>>>>>>>  Any hints on how to debug the stuck topology? Any other useful
>>>>>>>>> info I might provide?
>>>>>>>>>
>>>>>>>>>  Thanks,
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>
>>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>> T: +385989035562
>>>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Danijel Schiavuzzi
>>>>>>>>
>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>> W: www.schiavuzzi.com
>>>>>>>> T: +385989035562
>>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Danijel Schiavuzzi
>>>>>>>
>>>>>>> E: danijel@schiavuzzi.com
>>>>>>> W: www.schiavuzzi.com
>>>>>>> T: +385989035562
>>>>>>>  Skype: danijels7
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Danijel Schiavuzzi
>>>>>>
>>>>>> E: danijel@schiavuzzi.com
>>>>>> W: www.schiavuzzi.com
>>>>>> T: +385989035562
>>>>>> Skype: danijels7
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Danijel Schiavuzzi
>>>>>
>>>>> E: danijel@schiavuzzi.com
>>>>> W: www.schiavuzzi.com
>>>>> T: +385989035562
>>>>> Skype: danijels7
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Danijel Schiavuzzi
>>>
>>> E: danijel@schiavuzzi.com
>>> W: www.schiavuzzi.com
>>> T: +385989035562
>>> Skype: danijels7
>>>
>>
>>
>
> --
> Danijel Schiavuzzi
>
> E: danijel@schiavuzzi.com
> W: www.schiavuzzi.com
> T: +385989035562
> Skype: danijels7
>

Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Posted by Danijel Schiavuzzi <da...@schiavuzzi.com>.

I see no issues with your cluster configuration.

You should definitely share the (simplified if possible) topology code and
the steps to reproduce the blockage, better yet you should file a JIRA task
on Apache's JIRA web -- be sure to include your Trident
internals modifications.

Unfortunately, seems I'm having the same issues now with Storm 0.9.2
too, so I might get back here with some updates soon. It's not so fast
and easily reproducible as it was under 0.9.1, but the bug
seems nonetheless still present. I'll reduce the number of Storm slots and
topology workers as per your insights, hopefully this might make it easier
to reproduce the bug with a simplified Trident topology.

On Tuesday, July 15, 2014, M.Tarkeshwar Rao <ta...@gmail.com> wrote:

> Hi Denijel,
>
> We have done few changes in the the trident core framework code as per our
> need which is working fine with zeromq. I am sharing configuration which we
> are using. Can you please suggest our config is fine or not?
>
>  Code part is so large so we are writing some sample topology and trying
> to reproduce the issue, which we will share with you.
>
> What are the steps to reproduce the issue:
> -------------------------------------------------------------
>
> 1. we deployed our topology with one linux machine, two workers and one
> acker with batch size 2.
> 2. both the worker are up and start the processing.
> 3. after few seconds i killed one of the worker kill -9.
> 4. when the killed worker spawned on the same port it is getting hanged.
> 5. only retries going on.
> 6. when the killed worker spawned on the another port everything working
> fine.
>
> machine conf:
> --------------------------
> [root@sb6270x1637-2 conf]# uname -a
>
> Linux bl460cx2378 2.6.32-431.5.1.el6.x86_64 #1 SMP Fri Jan 10 14:46:43 EST
> 2014 x86_64 x86_64 x86_64 GNU/Linux
>
>
> *storm.yaml* which we are using to launch  nimbus, supervisor and ui
>
> ########## These MUST be filled in for a storm configuration
>  storm.zookeeper.servers:
>      - "10.61.244.86"
>  storm.zookeeper.port: 2000
>  supervisor.slots.ports:
>     - 6788
>     - 6789
>     - 6800
>     - 6801
>     - 6802
>     - 6803
>
>  nimbus.host: "10.61.244.86"
>
>
>  storm.messaging.transport: "backtype.storm.messaging.netty.Context"
>
>  storm.messaging.netty.server_worker_threads: 10
>  storm.messaging.netty.client_worker_threads: 10
>  storm.messaging.netty.buffer_size: 5242880
>  storm.messaging.netty.max_retries: 100
>  storm.messaging.netty.max_wait_ms: 1000
>  storm.messaging.netty.min_wait_ms: 100
>  storm.local.dir: "/root/home_98/home/enavgoy/storm-local"
>  storm.scheduler: "com.ericsson.storm.scheduler.TopologyScheduler"
>  topology.acker.executors: 1
>  topology.message.timeout.secs: 30
>  supervisor.scheduler.meta:
>       name: "supervisor1"
>
>
>  worker.childopts: "-Xmx2048m"
>
>  mm.hdfs.ipaddress: "10.61.244.7"
>  mm.hdfs.port: 9000
>  topology.batch.size: 2
>  topology.batch.timeout: 10000
>  topology.workers: 2
>  topology.debug: true
>
> Regards
> Tarkeshwar
>
>
>
> On Mon, Jul 7, 2014 at 1:22 PM, Danijel Schiavuzzi <danijel@schiavuzzi.com
> <javascript:_e(%7B%7D,'cvml','danijel@schiavuzzi.com');>> wrote:
>
>> Hi Tarkeshwar,
>>
>> Could you provide a code sample of your topology? Do you have any special
>> configs enabled?
>>
>> Thanks,
>>
>> Danijel
>>
>>
>> On Mon, Jul 7, 2014 at 9:01 AM, M.Tarkeshwar Rao <tarkeshwar4u@gmail.com
>> <javascript:_e(%7B%7D,'cvml','tarkeshwar4u@gmail.com');>> wrote:
>>
>>> Hi Danijel,
>>>
>>> We are able to reproduce this issue with 0.9.2 as well.
>>> We have two worker setup to run the trident topology.
>>>
>>> When we kill one of the worker and again when that killed worker spawn
>>> on same port(same slot) then that worker not able to communicate with 2nd
>>> worker.
>>>
>>> only transaction attempts are increasing continuously.
>>>
>>> But if the killed worker spawn on new slot(new communication port) then
>>> it working fine. Same behavior as in storm 9.0.1.
>>>
>>> Please update me if you get any new development.
>>>
>>> Regards
>>> Tarkeshwar
>>>
>>>
>>> On Thu, Jul 3, 2014 at 7:06 PM, Danijel Schiavuzzi <
>>> danijel@schiavuzzi.com
>>> <javascript:_e(%7B%7D,'cvml','danijel@schiavuzzi.com');>> wrote:
>>>
>>>> Hi Bobby,
>>>>
>>>> Just an update on the stuck Trident transactional topology issue --
>>>> I've upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and can't
>>>> reproduce the bug anymore. Will keep you posted if any issues arise.
>>>>
>>>> Regards,
>>>>
>>>> Danijel
>>>>
>>>>
>>>> On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <evans@yahoo-inc.com
>>>> <javascript:_e(%7B%7D,'cvml','evans@yahoo-inc.com');>> wrote:
>>>>
>>>>>  I have not seen this before, if you could file a JIRA on this that
>>>>> would be great.
>>>>>
>>>>>  - Bobby
>>>>>
>>>>>   From: Danijel Schiavuzzi <danijel@schiavuzzi.com
>>>>> <javascript:_e(%7B%7D,'cvml','danijel@schiavuzzi.com');>>
>>>>> Reply-To: "user@storm.incubator.apache.org
>>>>> <javascript:_e(%7B%7D,'cvml','user@storm.incubator.apache.org');>" <
>>>>> user@storm.incubator.apache.org
>>>>> <javascript:_e(%7B%7D,'cvml','user@storm.incubator.apache.org');>>
>>>>> Date: Wednesday, June 4, 2014 at 10:30 AM
>>>>> To: "user@storm.incubator.apache.org
>>>>> <javascript:_e(%7B%7D,'cvml','user@storm.incubator.apache.org');>" <
>>>>> user@storm.incubator.apache.org
>>>>> <javascript:_e(%7B%7D,'cvml','user@storm.incubator.apache.org');>>, "
>>>>> dev@storm.incubator.apache.org
>>>>> <javascript:_e(%7B%7D,'cvml','dev@storm.incubator.apache.org');>" <
>>>>> dev@storm.incubator.apache.org
>>>>> <javascript:_e(%7B%7D,'cvml','dev@storm.incubator.apache.org');>>
>>>>> Subject: Trident transactional topology stuck re-emitting batches
>>>>> with Netty, but running fine with ZMQ (was Re: Topology is stuck)
>>>>>
>>>>>   Hi all,
>>>>>
>>>>> I've managed to reproduce the stuck topology problem and it seems it's
>>>>> due to the Netty transport. Running with ZMQ transport enabled now and I
>>>>> haven't been able to reproduce this.
>>>>>
>>>>>  The problem is basically a Trident/Kafka transactional topology
>>>>> getting stuck, i.e. re-emitting the same batches over and over again. This
>>>>> happens after the Storm workers restart a few times due to Kafka spout
>>>>> throwing RuntimeExceptions (because of the Kafka consumer in the spout
>>>>> timing out with a SocketTimeoutException due to some temporary network
>>>>> problems). Sometimes the topology is stuck after just one worker is
>>>>> restarted, and sometimes a few worker restarts are needed to trigger the
>>>>> problem.
>>>>>
>>>>> I simulated the Kafka spout socket timeouts by blocking network access
>>>>> from Storm to my Kafka machines (with an iptables firewall rule). Most of
>>>>> the time the spouts (workers) would restart normally (after re-enabling
>>>>> access to Kafka) and the topology would continue to process batches, but
>>>>> sometimes the topology would get stuck re-emitting batches after the
>>>>> crashed workers restarted. Killing and re-submitting the topology manually
>>>>> fixes this always, and processing continues normally.
>>>>>
>>>>>  I haven't been able to reproduce this scenario after reverting my
>>>>> Storm cluster's transport to ZeroMQ. With Netty transport, I can almost
>>>>> always reproduce the problem by causing a worker to restart a number of
>>>>> times (only about 4-5 worker restarts are enough to trigger this).
>>>>>
>>>>>  Any hints on this? Anyone had the same problem? It does seem a
>>>>> serious issue as it affect the reliability and fault tolerance of the Storm
>>>>> cluster.
>>>>>
>>>>>  In the meantime, I'll try to prepare a reproducible test case for
>>>>> this.
>>>>>
>>>>>  Thanks,
>>>>>
>>>>> Danijel
>>>>>
>>>>>
>>>>> On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi <
>>>>> danijel@schiavuzzi.com
>>>>> <javascript:_e(%7B%7D,'cvml','danijel@schiavuzzi.com');>> wrote:
>>>>>
>>>>>> To (partially) answer my own question -- I still have no idea on the
>>>>>> cause of the stuck topology, but re-submitting the topology helps -- after
>>>>>> re-submitting my topology is now running normally.
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
>>>>>> danijel@schiavuzzi.com
>>>>>> <javascript:_e(%7B%7D,'cvml','danijel@schiavuzzi.com');>> wrote:
>>>>>>
>>>>>>>  Also, I did have multiple cases of my IBackingMap workers dying
>>>>>>> (because of RuntimeExceptions) but successfully restarting afterwards (I
>>>>>>> throw RuntimeExceptions in the BackingMap implementation as my strategy in
>>>>>>> rare SQL database deadlock situations to force a worker restart and to
>>>>>>> fail+retry the batch).
>>>>>>>
>>>>>>>  From the logs, one such IBackingMap worker death (and subsequent
>>>>>>> restart) resulted in the Kafka spout re-emitting the pending tuple:
>>>>>>>
>>>>>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting
>>>>>>> batch, attempt 29698959:736
>>>>>>>
>>>>>>>  This is of course the normal behavior of a transactional topology,
>>>>>>> but this is the first time I've encountered a case of a batch retrying
>>>>>>> indefinitely. This is especially suspicious since the topology has been
>>>>>>> running fine for 20 days straight, re-emitting batches and restarting
>>>>>>> IBackingMap workers quite a number of times.
>>>>>>>
>>>>>>> I can see in my IBackingMap backing SQL database that the batch with
>>>>>>> the exact txid value 29698959 has been committed -- but I suspect that
>>>>>>> could come from another BackingMap, since there are two BackingMap
>>>>>>> instances running (paralellismHint 2).
>>>>>>>
>>>>>>>  However, I have no idea why the batch is being retried
>>>>>>> indefinitely now nor why it hasn't been successfully acked by Trident.
>>>>>>>
>>>>>>> Any suggestions on the area (topology component) to focus my
>>>>>>> research on?
>>>>>>>
>>>>>>>  Thanks,
>>>>>>>
>>>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>>>>>>> danijel@schiavuzzi.com
>>>>>>> <javascript:_e(%7B%7D,'cvml','danijel@schiavuzzi.com');>> wrote:
>>>>>>>
>>>>>>>>   Hello,
>>>>>>>>
>>>>>>>> I'm having problems with my transactional Trident topology. It has
>>>>>>>> been running fine for about 20 days, and suddenly is stuck processing a
>>>>>>>> single batch, with no tuples being emitted nor tuples being persisted by
>>>>>>>> the TridentState (IBackingMap).
>>>>>>>>
>>>>>>>> It's a simple topology which consumes messages off a Kafka queue.
>>>>>>>> The spout is an instance of storm-kafka-0.8-plus
>>>>>>>> TransactionalTridentKafkaSpout and I use the trident-mssql transactional
>>>>>>>> TridentState implementation to persistentAggregate() data into a SQL
>>>>>>>> database.
>>>>>>>>
>>>>>>>>  In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>>>>>>
>>>>>>>>      "/transactional/<myTopologyName>/coordinator/currattempts" is
>>>>>>>> "{"29698959":6487}"
>>>>>>>>
>>>>>>>> ... and the attempt count keeps increasing. It seems the batch with
>>>>>>>> txid 29698959 is stuck, as the attempt count in Zookeeper keeps increasing
>>>>>>>> -- seems like the batch isn't being acked by Trident and I have no idea
>>>>>>>> why, especially since the topology has been running successfully the last
>>>>>>>> 20 days.
>>>>>>>>
>>>>>>>>  I did rebalance the topology on one occasion, after which it
>>>>>>>> continued running normally. Other than that, no other modifications were
>>>>>>>> done. Storm is at version 0.9.0.1.
>>>>>>>>
>>>>>>>>  Any hints on how to debug the stuck topology? Any other useful
>>>>>>>> info I might provide?
>>>>>>>>
>>>>>>>>  Thanks,
>>>>>>>>
>>>>>>>> --
>>>>>>>> Danijel Schiavuzzi
>>>>>>>>
>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>> <javascript:_e(%7B%7D,'cvml','danijel@schiavuzzi.com');>
>>>>>>>> W: www.schiavuzzi.com
>>>>>>>> T: +385989035562
>>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Danijel Schiavuzzi
>>>>>>>
>>>>>>> E: danijel@schiavuzzi.com
>>>>>>> <javascript:_e(%7B%7D,'cvml','danijel@schiavuzzi.com');>
>>>>>>> W: www.schiavuzzi.com
>>>>>>> T: +385989035562
>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Danijel Schiavuzzi
>>>>>>
>>>>>> E: danijel@schiavuzzi.com
>>>>>> <javascript:_e(%7B%7D,'cvml','danijel@schiavuzzi.com');>
>>>>>> W: www.schiavuzzi.com
>>>>>> T: +385989035562
>>>>>>  Skype: danijels7
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Danijel Schiavuzzi
>>>>>
>>>>> E: danijel@schiavuzzi.com
>>>>> <javascript:_e(%7B%7D,'cvml','danijel@schiavuzzi.com');>
>>>>> W: www.schiavuzzi.com
>>>>> T: +385989035562
>>>>> Skype: danijels7
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Danijel Schiavuzzi
>>>>
>>>> E: danijel@schiavuzzi.com
>>>> <javascript:_e(%7B%7D,'cvml','danijel@schiavuzzi.com');>
>>>> W: www.schiavuzzi.com
>>>> T: +385989035562
>>>> Skype: danijels7
>>>>
>>>
>>>
>>
>>
>> --
>> Danijel Schiavuzzi
>>
>> E: danijel@schiavuzzi.com
>> <javascript:_e(%7B%7D,'cvml','danijel@schiavuzzi.com');>
>> W: www.schiavuzzi.com
>> T: +385989035562
>> Skype: danijels7
>>
>
>

-- 
Danijel Schiavuzzi

E: danijel@schiavuzzi.com
W: www.schiavuzzi.com
T: +385989035562
Skype: danijels7

Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Posted by "M.Tarkeshwar Rao" <ta...@gmail.com>.

Hi Denijel,

We have done few changes in the the trident core framework code as per our
need which is working fine with zeromq. I am sharing configuration which we
are using. Can you please suggest our config is fine or not?

 Code part is so large so we are writing some sample topology and trying to
reproduce the issue, which we will share with you.

What are the steps to reproduce the issue:
-------------------------------------------------------------

1. we deployed our topology with one linux machine, two workers and one
acker with batch size 2.
2. both the worker are up and start the processing.
3. after few seconds i killed one of the worker kill -9.
4. when the killed worker spawned on the same port it is getting hanged.
5. only retries going on.
6. when the killed worker spawned on the another port everything working
fine.

machine conf:
--------------------------
[root@sb6270x1637-2 conf]# uname -a

Linux bl460cx2378 2.6.32-431.5.1.el6.x86_64 #1 SMP Fri Jan 10 14:46:43 EST
2014 x86_64 x86_64 x86_64 GNU/Linux


*storm.yaml* which we are using to launch  nimbus, supervisor and ui

########## These MUST be filled in for a storm configuration
 storm.zookeeper.servers:
     - "10.61.244.86"
 storm.zookeeper.port: 2000
 supervisor.slots.ports:
    - 6788
    - 6789
    - 6800
    - 6801
    - 6802
    - 6803

 nimbus.host: "10.61.244.86"


 storm.messaging.transport: "backtype.storm.messaging.netty.Context"

 storm.messaging.netty.server_worker_threads: 10
 storm.messaging.netty.client_worker_threads: 10
 storm.messaging.netty.buffer_size: 5242880
 storm.messaging.netty.max_retries: 100
 storm.messaging.netty.max_wait_ms: 1000
 storm.messaging.netty.min_wait_ms: 100
 storm.local.dir: "/root/home_98/home/enavgoy/storm-local"
 storm.scheduler: "com.ericsson.storm.scheduler.TopologyScheduler"
 topology.acker.executors: 1
 topology.message.timeout.secs: 30
 supervisor.scheduler.meta:
      name: "supervisor1"


 worker.childopts: "-Xmx2048m"

 mm.hdfs.ipaddress: "10.61.244.7"
 mm.hdfs.port: 9000
 topology.batch.size: 2
 topology.batch.timeout: 10000
 topology.workers: 2
 topology.debug: true

Regards
Tarkeshwar



On Mon, Jul 7, 2014 at 1:22 PM, Danijel Schiavuzzi <da...@schiavuzzi.com>
wrote:

> Hi Tarkeshwar,
>
> Could you provide a code sample of your topology? Do you have any special
> configs enabled?
>
> Thanks,
>
> Danijel
>
>
> On Mon, Jul 7, 2014 at 9:01 AM, M.Tarkeshwar Rao <ta...@gmail.com>
> wrote:
>
>> Hi Danijel,
>>
>> We are able to reproduce this issue with 0.9.2 as well.
>> We have two worker setup to run the trident topology.
>>
>> When we kill one of the worker and again when that killed worker spawn on
>> same port(same slot) then that worker not able to communicate with 2nd
>> worker.
>>
>> only transaction attempts are increasing continuously.
>>
>> But if the killed worker spawn on new slot(new communication port) then
>> it working fine. Same behavior as in storm 9.0.1.
>>
>> Please update me if you get any new development.
>>
>> Regards
>> Tarkeshwar
>>
>>
>> On Thu, Jul 3, 2014 at 7:06 PM, Danijel Schiavuzzi <
>> danijel@schiavuzzi.com> wrote:
>>
>>> Hi Bobby,
>>>
>>> Just an update on the stuck Trident transactional topology issue -- I've
>>> upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and can't
>>> reproduce the bug anymore. Will keep you posted if any issues arise.
>>>
>>> Regards,
>>>
>>> Danijel
>>>
>>>
>>> On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <ev...@yahoo-inc.com>
>>> wrote:
>>>
>>>>  I have not seen this before, if you could file a JIRA on this that
>>>> would be great.
>>>>
>>>>  - Bobby
>>>>
>>>>   From: Danijel Schiavuzzi <da...@schiavuzzi.com>
>>>> Reply-To: "user@storm.incubator.apache.org" <
>>>> user@storm.incubator.apache.org>
>>>> Date: Wednesday, June 4, 2014 at 10:30 AM
>>>> To: "user@storm.incubator.apache.org" <us...@storm.incubator.apache.org>,
>>>> "dev@storm.incubator.apache.org" <de...@storm.incubator.apache.org>
>>>> Subject: Trident transactional topology stuck re-emitting batches with
>>>> Netty, but running fine with ZMQ (was Re: Topology is stuck)
>>>>
>>>>   Hi all,
>>>>
>>>> I've managed to reproduce the stuck topology problem and it seems it's
>>>> due to the Netty transport. Running with ZMQ transport enabled now and I
>>>> haven't been able to reproduce this.
>>>>
>>>>  The problem is basically a Trident/Kafka transactional topology
>>>> getting stuck, i.e. re-emitting the same batches over and over again. This
>>>> happens after the Storm workers restart a few times due to Kafka spout
>>>> throwing RuntimeExceptions (because of the Kafka consumer in the spout
>>>> timing out with a SocketTimeoutException due to some temporary network
>>>> problems). Sometimes the topology is stuck after just one worker is
>>>> restarted, and sometimes a few worker restarts are needed to trigger the
>>>> problem.
>>>>
>>>> I simulated the Kafka spout socket timeouts by blocking network access
>>>> from Storm to my Kafka machines (with an iptables firewall rule). Most of
>>>> the time the spouts (workers) would restart normally (after re-enabling
>>>> access to Kafka) and the topology would continue to process batches, but
>>>> sometimes the topology would get stuck re-emitting batches after the
>>>> crashed workers restarted. Killing and re-submitting the topology manually
>>>> fixes this always, and processing continues normally.
>>>>
>>>>  I haven't been able to reproduce this scenario after reverting my
>>>> Storm cluster's transport to ZeroMQ. With Netty transport, I can almost
>>>> always reproduce the problem by causing a worker to restart a number of
>>>> times (only about 4-5 worker restarts are enough to trigger this).
>>>>
>>>>  Any hints on this? Anyone had the same problem? It does seem a
>>>> serious issue as it affect the reliability and fault tolerance of the Storm
>>>> cluster.
>>>>
>>>>  In the meantime, I'll try to prepare a reproducible test case for
>>>> this.
>>>>
>>>>  Thanks,
>>>>
>>>> Danijel
>>>>
>>>>
>>>> On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi <
>>>> danijel@schiavuzzi.com> wrote:
>>>>
>>>>> To (partially) answer my own question -- I still have no idea on the
>>>>> cause of the stuck topology, but re-submitting the topology helps -- after
>>>>> re-submitting my topology is now running normally.
>>>>>
>>>>>
>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
>>>>> danijel@schiavuzzi.com> wrote:
>>>>>
>>>>>>  Also, I did have multiple cases of my IBackingMap workers dying
>>>>>> (because of RuntimeExceptions) but successfully restarting afterwards (I
>>>>>> throw RuntimeExceptions in the BackingMap implementation as my strategy in
>>>>>> rare SQL database deadlock situations to force a worker restart and to
>>>>>> fail+retry the batch).
>>>>>>
>>>>>>  From the logs, one such IBackingMap worker death (and subsequent
>>>>>> restart) resulted in the Kafka spout re-emitting the pending tuple:
>>>>>>
>>>>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting
>>>>>> batch, attempt 29698959:736
>>>>>>
>>>>>>  This is of course the normal behavior of a transactional topology,
>>>>>> but this is the first time I've encountered a case of a batch retrying
>>>>>> indefinitely. This is especially suspicious since the topology has been
>>>>>> running fine for 20 days straight, re-emitting batches and restarting
>>>>>> IBackingMap workers quite a number of times.
>>>>>>
>>>>>> I can see in my IBackingMap backing SQL database that the batch with
>>>>>> the exact txid value 29698959 has been committed -- but I suspect that
>>>>>> could come from another BackingMap, since there are two BackingMap
>>>>>> instances running (paralellismHint 2).
>>>>>>
>>>>>>  However, I have no idea why the batch is being retried indefinitely
>>>>>> now nor why it hasn't been successfully acked by Trident.
>>>>>>
>>>>>> Any suggestions on the area (topology component) to focus my research
>>>>>> on?
>>>>>>
>>>>>>  Thanks,
>>>>>>
>>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>
>>>>>>>   Hello,
>>>>>>>
>>>>>>> I'm having problems with my transactional Trident topology. It has
>>>>>>> been running fine for about 20 days, and suddenly is stuck processing a
>>>>>>> single batch, with no tuples being emitted nor tuples being persisted by
>>>>>>> the TridentState (IBackingMap).
>>>>>>>
>>>>>>> It's a simple topology which consumes messages off a Kafka queue.
>>>>>>> The spout is an instance of storm-kafka-0.8-plus
>>>>>>> TransactionalTridentKafkaSpout and I use the trident-mssql transactional
>>>>>>> TridentState implementation to persistentAggregate() data into a SQL
>>>>>>> database.
>>>>>>>
>>>>>>>  In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>>>>>
>>>>>>>      "/transactional/<myTopologyName>/coordinator/currattempts" is
>>>>>>> "{"29698959":6487}"
>>>>>>>
>>>>>>> ... and the attempt count keeps increasing. It seems the batch with
>>>>>>> txid 29698959 is stuck, as the attempt count in Zookeeper keeps increasing
>>>>>>> -- seems like the batch isn't being acked by Trident and I have no idea
>>>>>>> why, especially since the topology has been running successfully the last
>>>>>>> 20 days.
>>>>>>>
>>>>>>>  I did rebalance the topology on one occasion, after which it
>>>>>>> continued running normally. Other than that, no other modifications were
>>>>>>> done. Storm is at version 0.9.0.1.
>>>>>>>
>>>>>>>  Any hints on how to debug the stuck topology? Any other useful
>>>>>>> info I might provide?
>>>>>>>
>>>>>>>  Thanks,
>>>>>>>
>>>>>>> --
>>>>>>> Danijel Schiavuzzi
>>>>>>>
>>>>>>> E: danijel@schiavuzzi.com
>>>>>>> W: www.schiavuzzi.com
>>>>>>> T: +385989035562
>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Danijel Schiavuzzi
>>>>>>
>>>>>> E: danijel@schiavuzzi.com
>>>>>> W: www.schiavuzzi.com
>>>>>> T: +385989035562
>>>>>> Skype: danijel.schiavuzzi
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Danijel Schiavuzzi
>>>>>
>>>>> E: danijel@schiavuzzi.com
>>>>> W: www.schiavuzzi.com
>>>>> T: +385989035562
>>>>>  Skype: danijels7
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Danijel Schiavuzzi
>>>>
>>>> E: danijel@schiavuzzi.com
>>>> W: www.schiavuzzi.com
>>>> T: +385989035562
>>>> Skype: danijels7
>>>>
>>>
>>>
>>>
>>> --
>>> Danijel Schiavuzzi
>>>
>>> E: danijel@schiavuzzi.com
>>> W: www.schiavuzzi.com
>>> T: +385989035562
>>> Skype: danijels7
>>>
>>
>>
>
>
> --
> Danijel Schiavuzzi
>
> E: danijel@schiavuzzi.com
> W: www.schiavuzzi.com
> T: +385989035562
> Skype: danijels7
>

Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Posted by Danijel Schiavuzzi <da...@schiavuzzi.com>.

Hi Tarkeshwar,

Could you provide a code sample of your topology? Do you have any special
configs enabled?

Thanks,

Danijel


On Mon, Jul 7, 2014 at 9:01 AM, M.Tarkeshwar Rao <ta...@gmail.com>
wrote:

> Hi Danijel,
>
> We are able to reproduce this issue with 0.9.2 as well.
> We have two worker setup to run the trident topology.
>
> When we kill one of the worker and again when that killed worker spawn on
> same port(same slot) then that worker not able to communicate with 2nd
> worker.
>
> only transaction attempts are increasing continuously.
>
> But if the killed worker spawn on new slot(new communication port) then it
> working fine. Same behavior as in storm 9.0.1.
>
> Please update me if you get any new development.
>
> Regards
> Tarkeshwar
>
>
> On Thu, Jul 3, 2014 at 7:06 PM, Danijel Schiavuzzi <danijel@schiavuzzi.com
> > wrote:
>
>> Hi Bobby,
>>
>> Just an update on the stuck Trident transactional topology issue -- I've
>> upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and can't
>> reproduce the bug anymore. Will keep you posted if any issues arise.
>>
>> Regards,
>>
>> Danijel
>>
>>
>> On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <ev...@yahoo-inc.com> wrote:
>>
>>>  I have not seen this before, if you could file a JIRA on this that
>>> would be great.
>>>
>>>  - Bobby
>>>
>>>   From: Danijel Schiavuzzi <da...@schiavuzzi.com>
>>> Reply-To: "user@storm.incubator.apache.org" <
>>> user@storm.incubator.apache.org>
>>> Date: Wednesday, June 4, 2014 at 10:30 AM
>>> To: "user@storm.incubator.apache.org" <us...@storm.incubator.apache.org>,
>>> "dev@storm.incubator.apache.org" <de...@storm.incubator.apache.org>
>>> Subject: Trident transactional topology stuck re-emitting batches with
>>> Netty, but running fine with ZMQ (was Re: Topology is stuck)
>>>
>>>   Hi all,
>>>
>>> I've managed to reproduce the stuck topology problem and it seems it's
>>> due to the Netty transport. Running with ZMQ transport enabled now and I
>>> haven't been able to reproduce this.
>>>
>>>  The problem is basically a Trident/Kafka transactional topology
>>> getting stuck, i.e. re-emitting the same batches over and over again. This
>>> happens after the Storm workers restart a few times due to Kafka spout
>>> throwing RuntimeExceptions (because of the Kafka consumer in the spout
>>> timing out with a SocketTimeoutException due to some temporary network
>>> problems). Sometimes the topology is stuck after just one worker is
>>> restarted, and sometimes a few worker restarts are needed to trigger the
>>> problem.
>>>
>>> I simulated the Kafka spout socket timeouts by blocking network access
>>> from Storm to my Kafka machines (with an iptables firewall rule). Most of
>>> the time the spouts (workers) would restart normally (after re-enabling
>>> access to Kafka) and the topology would continue to process batches, but
>>> sometimes the topology would get stuck re-emitting batches after the
>>> crashed workers restarted. Killing and re-submitting the topology manually
>>> fixes this always, and processing continues normally.
>>>
>>>  I haven't been able to reproduce this scenario after reverting my
>>> Storm cluster's transport to ZeroMQ. With Netty transport, I can almost
>>> always reproduce the problem by causing a worker to restart a number of
>>> times (only about 4-5 worker restarts are enough to trigger this).
>>>
>>>  Any hints on this? Anyone had the same problem? It does seem a serious
>>> issue as it affect the reliability and fault tolerance of the Storm cluster.
>>>
>>>  In the meantime, I'll try to prepare a reproducible test case for this.
>>>
>>>  Thanks,
>>>
>>> Danijel
>>>
>>>
>>> On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi <
>>> danijel@schiavuzzi.com> wrote:
>>>
>>>> To (partially) answer my own question -- I still have no idea on the
>>>> cause of the stuck topology, but re-submitting the topology helps -- after
>>>> re-submitting my topology is now running normally.
>>>>
>>>>
>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
>>>> danijel@schiavuzzi.com> wrote:
>>>>
>>>>>  Also, I did have multiple cases of my IBackingMap workers dying
>>>>> (because of RuntimeExceptions) but successfully restarting afterwards (I
>>>>> throw RuntimeExceptions in the BackingMap implementation as my strategy in
>>>>> rare SQL database deadlock situations to force a worker restart and to
>>>>> fail+retry the batch).
>>>>>
>>>>>  From the logs, one such IBackingMap worker death (and subsequent
>>>>> restart) resulted in the Kafka spout re-emitting the pending tuple:
>>>>>
>>>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting
>>>>> batch, attempt 29698959:736
>>>>>
>>>>>  This is of course the normal behavior of a transactional topology,
>>>>> but this is the first time I've encountered a case of a batch retrying
>>>>> indefinitely. This is especially suspicious since the topology has been
>>>>> running fine for 20 days straight, re-emitting batches and restarting
>>>>> IBackingMap workers quite a number of times.
>>>>>
>>>>> I can see in my IBackingMap backing SQL database that the batch with
>>>>> the exact txid value 29698959 has been committed -- but I suspect that
>>>>> could come from another BackingMap, since there are two BackingMap
>>>>> instances running (paralellismHint 2).
>>>>>
>>>>>  However, I have no idea why the batch is being retried indefinitely
>>>>> now nor why it hasn't been successfully acked by Trident.
>>>>>
>>>>> Any suggestions on the area (topology component) to focus my research
>>>>> on?
>>>>>
>>>>>  Thanks,
>>>>>
>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>>>>> danijel@schiavuzzi.com> wrote:
>>>>>
>>>>>>   Hello,
>>>>>>
>>>>>> I'm having problems with my transactional Trident topology. It has
>>>>>> been running fine for about 20 days, and suddenly is stuck processing a
>>>>>> single batch, with no tuples being emitted nor tuples being persisted by
>>>>>> the TridentState (IBackingMap).
>>>>>>
>>>>>> It's a simple topology which consumes messages off a Kafka queue. The
>>>>>> spout is an instance of storm-kafka-0.8-plus TransactionalTridentKafkaSpout
>>>>>> and I use the trident-mssql transactional TridentState implementation to
>>>>>> persistentAggregate() data into a SQL database.
>>>>>>
>>>>>>  In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>>>>
>>>>>>      "/transactional/<myTopologyName>/coordinator/currattempts" is
>>>>>> "{"29698959":6487}"
>>>>>>
>>>>>> ... and the attempt count keeps increasing. It seems the batch with
>>>>>> txid 29698959 is stuck, as the attempt count in Zookeeper keeps increasing
>>>>>> -- seems like the batch isn't being acked by Trident and I have no idea
>>>>>> why, especially since the topology has been running successfully the last
>>>>>> 20 days.
>>>>>>
>>>>>>  I did rebalance the topology on one occasion, after which it
>>>>>> continued running normally. Other than that, no other modifications were
>>>>>> done. Storm is at version 0.9.0.1.
>>>>>>
>>>>>>  Any hints on how to debug the stuck topology? Any other useful info
>>>>>> I might provide?
>>>>>>
>>>>>>  Thanks,
>>>>>>
>>>>>> --
>>>>>> Danijel Schiavuzzi
>>>>>>
>>>>>> E: danijel@schiavuzzi.com
>>>>>> W: www.schiavuzzi.com
>>>>>> T: +385989035562
>>>>>> Skype: danijel.schiavuzzi
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Danijel Schiavuzzi
>>>>>
>>>>> E: danijel@schiavuzzi.com
>>>>> W: www.schiavuzzi.com
>>>>> T: +385989035562
>>>>> Skype: danijel.schiavuzzi
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Danijel Schiavuzzi
>>>>
>>>> E: danijel@schiavuzzi.com
>>>> W: www.schiavuzzi.com
>>>> T: +385989035562
>>>>  Skype: danijels7
>>>>
>>>
>>>
>>>
>>> --
>>> Danijel Schiavuzzi
>>>
>>> E: danijel@schiavuzzi.com
>>> W: www.schiavuzzi.com
>>> T: +385989035562
>>> Skype: danijels7
>>>
>>
>>
>>
>> --
>> Danijel Schiavuzzi
>>
>> E: danijel@schiavuzzi.com
>> W: www.schiavuzzi.com
>> T: +385989035562
>> Skype: danijels7
>>
>
>


-- 
Danijel Schiavuzzi

E: danijel@schiavuzzi.com
W: www.schiavuzzi.com
T: +385989035562
Skype: danijels7

Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Posted by "M.Tarkeshwar Rao" <ta...@gmail.com>.

Hi Danijel,

We are able to reproduce this issue with 0.9.2 as well.
We have two worker setup to run the trident topology.

When we kill one of the worker and again when that killed worker spawn on
same port(same slot) then that worker not able to communicate with 2nd
worker.

only transaction attempts are increasing continuously.

But if the killed worker spawn on new slot(new communication port) then it
working fine. Same behavior as in storm 9.0.1.

Please update me if you get any new development.

Regards
Tarkeshwar


On Thu, Jul 3, 2014 at 7:06 PM, Danijel Schiavuzzi <da...@schiavuzzi.com>
wrote:

> Hi Bobby,
>
> Just an update on the stuck Trident transactional topology issue -- I've
> upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and can't
> reproduce the bug anymore. Will keep you posted if any issues arise.
>
> Regards,
>
> Danijel
>
>
> On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <ev...@yahoo-inc.com> wrote:
>
>>  I have not seen this before, if you could file a JIRA on this that
>> would be great.
>>
>>  - Bobby
>>
>>   From: Danijel Schiavuzzi <da...@schiavuzzi.com>
>> Reply-To: "user@storm.incubator.apache.org" <
>> user@storm.incubator.apache.org>
>> Date: Wednesday, June 4, 2014 at 10:30 AM
>> To: "user@storm.incubator.apache.org" <us...@storm.incubator.apache.org>,
>> "dev@storm.incubator.apache.org" <de...@storm.incubator.apache.org>
>> Subject: Trident transactional topology stuck re-emitting batches with
>> Netty, but running fine with ZMQ (was Re: Topology is stuck)
>>
>>   Hi all,
>>
>> I've managed to reproduce the stuck topology problem and it seems it's
>> due to the Netty transport. Running with ZMQ transport enabled now and I
>> haven't been able to reproduce this.
>>
>>  The problem is basically a Trident/Kafka transactional topology getting
>> stuck, i.e. re-emitting the same batches over and over again. This happens
>> after the Storm workers restart a few times due to Kafka spout throwing
>> RuntimeExceptions (because of the Kafka consumer in the spout timing out
>> with a SocketTimeoutException due to some temporary network problems).
>> Sometimes the topology is stuck after just one worker is restarted, and
>> sometimes a few worker restarts are needed to trigger the problem.
>>
>> I simulated the Kafka spout socket timeouts by blocking network access
>> from Storm to my Kafka machines (with an iptables firewall rule). Most of
>> the time the spouts (workers) would restart normally (after re-enabling
>> access to Kafka) and the topology would continue to process batches, but
>> sometimes the topology would get stuck re-emitting batches after the
>> crashed workers restarted. Killing and re-submitting the topology manually
>> fixes this always, and processing continues normally.
>>
>>  I haven't been able to reproduce this scenario after reverting my Storm
>> cluster's transport to ZeroMQ. With Netty transport, I can almost always
>> reproduce the problem by causing a worker to restart a number of times
>> (only about 4-5 worker restarts are enough to trigger this).
>>
>>  Any hints on this? Anyone had the same problem? It does seem a serious
>> issue as it affect the reliability and fault tolerance of the Storm cluster.
>>
>>  In the meantime, I'll try to prepare a reproducible test case for this.
>>
>>  Thanks,
>>
>> Danijel
>>
>>
>> On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi <
>> danijel@schiavuzzi.com> wrote:
>>
>>> To (partially) answer my own question -- I still have no idea on the
>>> cause of the stuck topology, but re-submitting the topology helps -- after
>>> re-submitting my topology is now running normally.
>>>
>>>
>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
>>> danijel@schiavuzzi.com> wrote:
>>>
>>>>  Also, I did have multiple cases of my IBackingMap workers dying
>>>> (because of RuntimeExceptions) but successfully restarting afterwards (I
>>>> throw RuntimeExceptions in the BackingMap implementation as my strategy in
>>>> rare SQL database deadlock situations to force a worker restart and to
>>>> fail+retry the batch).
>>>>
>>>>  From the logs, one such IBackingMap worker death (and subsequent
>>>> restart) resulted in the Kafka spout re-emitting the pending tuple:
>>>>
>>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting
>>>> batch, attempt 29698959:736
>>>>
>>>>  This is of course the normal behavior of a transactional topology,
>>>> but this is the first time I've encountered a case of a batch retrying
>>>> indefinitely. This is especially suspicious since the topology has been
>>>> running fine for 20 days straight, re-emitting batches and restarting
>>>> IBackingMap workers quite a number of times.
>>>>
>>>> I can see in my IBackingMap backing SQL database that the batch with
>>>> the exact txid value 29698959 has been committed -- but I suspect that
>>>> could come from another BackingMap, since there are two BackingMap
>>>> instances running (paralellismHint 2).
>>>>
>>>>  However, I have no idea why the batch is being retried indefinitely
>>>> now nor why it hasn't been successfully acked by Trident.
>>>>
>>>> Any suggestions on the area (topology component) to focus my research
>>>> on?
>>>>
>>>>  Thanks,
>>>>
>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>>>> danijel@schiavuzzi.com> wrote:
>>>>
>>>>>   Hello,
>>>>>
>>>>> I'm having problems with my transactional Trident topology. It has
>>>>> been running fine for about 20 days, and suddenly is stuck processing a
>>>>> single batch, with no tuples being emitted nor tuples being persisted by
>>>>> the TridentState (IBackingMap).
>>>>>
>>>>> It's a simple topology which consumes messages off a Kafka queue. The
>>>>> spout is an instance of storm-kafka-0.8-plus TransactionalTridentKafkaSpout
>>>>> and I use the trident-mssql transactional TridentState implementation to
>>>>> persistentAggregate() data into a SQL database.
>>>>>
>>>>>  In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>>>
>>>>>      "/transactional/<myTopologyName>/coordinator/currattempts" is
>>>>> "{"29698959":6487}"
>>>>>
>>>>> ... and the attempt count keeps increasing. It seems the batch with
>>>>> txid 29698959 is stuck, as the attempt count in Zookeeper keeps increasing
>>>>> -- seems like the batch isn't being acked by Trident and I have no idea
>>>>> why, especially since the topology has been running successfully the last
>>>>> 20 days.
>>>>>
>>>>>  I did rebalance the topology on one occasion, after which it
>>>>> continued running normally. Other than that, no other modifications were
>>>>> done. Storm is at version 0.9.0.1.
>>>>>
>>>>>  Any hints on how to debug the stuck topology? Any other useful info
>>>>> I might provide?
>>>>>
>>>>>  Thanks,
>>>>>
>>>>> --
>>>>> Danijel Schiavuzzi
>>>>>
>>>>> E: danijel@schiavuzzi.com
>>>>> W: www.schiavuzzi.com
>>>>> T: +385989035562
>>>>> Skype: danijel.schiavuzzi
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Danijel Schiavuzzi
>>>>
>>>> E: danijel@schiavuzzi.com
>>>> W: www.schiavuzzi.com
>>>> T: +385989035562
>>>> Skype: danijel.schiavuzzi
>>>>
>>>
>>>
>>>
>>> --
>>> Danijel Schiavuzzi
>>>
>>> E: danijel@schiavuzzi.com
>>> W: www.schiavuzzi.com
>>> T: +385989035562
>>>  Skype: danijels7
>>>
>>
>>
>>
>> --
>> Danijel Schiavuzzi
>>
>> E: danijel@schiavuzzi.com
>> W: www.schiavuzzi.com
>> T: +385989035562
>> Skype: danijels7
>>
>
>
>
> --
> Danijel Schiavuzzi
>
> E: danijel@schiavuzzi.com
> W: www.schiavuzzi.com
> T: +385989035562
> Skype: danijels7
>