You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Hari Hara Sudhan Ramachandran <lo...@gmail.com> on 2016/10/29 22:16:36 UTC

Storm processing old messages from microsoft eventhub after every topology restart

Hi All,



We are facing a peculiar issues in our application which uses Apache storm
for processing and Microsoft EventHub as sender.

Background information:

We have Event hub with 4 partition and using Microsoft provided
EventHubSpout in our Storm topology.

We are currently doing historical load. So, In an hour we get 50K messages
to the eventhub and storm needs to process it.

We scaled our Storm cluster to 18 nodes (4 core each) to support our
requirement.

                We have 68 Workers for an topology for which we getting the
peculiar issue.

                We keep the name of the topology same after every restarts.



I have below concerns, please rectify it.

1.       Is it true we need to match number of worker with number of
partition in the eventHub?

for eg., in our example, number of worker will be 4 since we have 4
partition event hub.

If so, what issue it will cause if number of worker is more than eventhub
partition?.

will it create the above duplication issue we are facing now?.

We are acking the tuples in our bolts correctly. Because, the Storm didn’t
pick the old message after it process once.

It will re-process the old message only after the restart of the topology.


2.       Is it true, it will give good processing power if the number of
worker is equal to supervisor nodes?

For eg.. we have 18 nodes, which means we can have 18 workers.

Currently, we are keeping 1 worker per slot .
for eg.. we have 18 nodes (4 slots each), so we kept 72 workers.

-- 
Thank You
Hari Hara Sudhan R