You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Vineet Mishra <cl...@gmail.com> on 2015/02/02 14:25:43 UTC

Storm Kafka Processing

Hi,

I am running Kafka Storm Engine to process real time data generated on a 3
node distributed cluster.

Currently I have set 10 Executors for Storm Spout, which I don't think is
running in parallel.
Moreover earlier I was running the Kafka Topology with Replication Factor
and Partitions as 1(which seems to have run comparatively faster), now I
gave the Replication Factor as 3 and Partitions as 10 and I could see the
performance degradation.

Is there any way I can max utilize the available resource and get the max
throughput of event processing.

Looking for the expert suggestions at URGENT.

Thanks!

Re: Storm Kafka Processing

Posted by Vineet Mishra <cl...@gmail.com>.
Its going fine now.

I used the proposed Kakfa-Storm library and it is working great.

Would like to quickly add into the same.

Is there another way so as to maximize the parallelism of storm spout
reading from kafka?
Or increasing the partitioners the only option for increasing throughput?

Thanks!
On Feb 3, 2015 8:21 PM, "Harsha" <st...@harsha.io> wrote:

>  Vineet,
>          In kafka producer.send(KeyedMessage<Id, Message>) are you passing
> in a ID. If this is constant or null your data won't be distributed to all
> partitions. In case of constant Id all of your messages goes to same
> partition and incase of null it chooses round-robin to distribute among
> partitions. Its better to use a random UUID to distribute among all of your
> partitions.
> -Harsha
>
>
> On Tue, Feb 3, 2015, at 12:44 AM, Vineet Mishra wrote:
>
> Do you mean to say that the event published to Kafka is not partition
> distributed?
>
> Well while creating the topic I ensured to create # of partitions as 10
> and replication factor as 3.
>
> Is it something effected as how I am writing to Kafka?
>
> Thanks!
>
> On Tue, Feb 3, 2015 at 1:50 PM, Andrew Neilson <ar...@gmail.com>
> wrote:
>
> The behaviour you are describing sounds like your topology is processing a
> small backlog of events built up in each partition and then catching up to
> realtime where events are only being published to one of the 10 partitions
> at a time. I will echo Harsha in suggesting that you verify you are
> actually publishing to all partitions (important: this is *not* the
> default behaviour).
>
> On Tue, Feb 3, 2015 at 12:05 AM, Vineet Mishra <cl...@gmail.com>
> wrote:
>
> Hi Harsha,
>
> Based on the proposed metric, I ensured the specified changes by changing
> the Kafka-Storm Version bundle.
>
> Although I could see the difference from the last bundle used to the
> current change but was not satisfied by the way Spouts were processing. The
> observation which I had was,
>
> The Spout were running with Executor counts as 10, while initiating the
> job around half of the executors(5) started processing in parallel to
> ingest the data.
>
> As soon as the counts reached around a million or so the state of
> parallelism dropped and eventually it started processing in serially(One
> Executor at a time).
>
> Executors (All time)
> IdUptimeHostPortEmittedTransferredComplete latency (ms)AckedFailed
> [2-2]13m 54shost36703000.00000
> [3-3]13m 52shost267023183003183004.7893181600
> [4-4]13m 52shost367024342004342007.0644343800
> [5-5]13m 53shost2670120200.00000
> [6-6]13m 55shost36701000.00000
> [7-7]13m 51shost2670025000250004.122245000
> [8-8]13m 51shost367002483602483609.5142457800
> [9-9]13m 52shost26703000.00000
> [10-10]13m 54shost367032352202352209.2502332000
> [11-11]13m 52shost2670220442020442010.3822058000
>
> I am having around .2 Billion Events ingested to Kafka which needs to be
> processed through Storm in Real time but I am not sure what is making this
> unexpected intermittent behavior of the storm and how can I prevent this in
> near future.
>
> Expecting Expert Suggestions.
>
> Thanks!
>
>
>
> On Mon, Feb 2, 2015 at 11:53 PM, Vineet Mishra <cl...@gmail.com>
> wrote:
>
> Well I am already running Kafka with 10 Partitions and Replication factor
> as 3 which is the default size of my cluster.
>
> bin/kafka-topics.sh --create --zookeeper host1:2181,host2:2181,host3:2181
> --replication-factor 3 --partitions 10 --topic test
>
> and I am also running Kafka Storm topology with Executors count as 10
>
> TopologyBuilder builder=new TopologyBuilder();
>         builder.setSpout("KafkaSpout", new KafkaSpout(kafkaConfig), 10);
>
> I am having a notion that since the time I have started running Kafka from
> last* changed RF and # of Partitions I am landing up with latency.
>
> * bin/kafka-topics.sh --create --zookeeper
> host1:2181,host2:2181,host3:2181 --replication-factor 1 --partitions 1
> --topic test
>
> Well I will try with above provided Storm Kafka bundle. Hope that could
> help out!
>
> Thanks!
>
> On Mon, Feb 2, 2015 at 10:30 PM, Harsha <st...@harsha.io> wrote:
>
>
> Vineet,
>        Can you try using the one in storm
> https://github.com/apache/storm/tree/master/external/storm-kafka . This
> is published into maven repo. So you can use the following
> <dependency>
> <groupId>org.apache.storm</groupId>
> <artifactId>storm-kafka</artifactId>
> <version>0.9.3</version>
> </dependency>
>
> If you are using topic with partitions size 10 make sure you configured
> your kafka spout with parallelism set to 10. Also make sure on the producer
> side you are pushing data onto all of the 10 partitions so that your kafka
> spout is fetching data from all of the 10 partitions.
>
> -Harsha
>
>
>
> On Mon, Feb 2, 2015, at 08:55 AM, Vineet Mishra wrote:
>
> Hi Harsha,
>
> I am using storm.kafka.KafkaSpout.KafkaSpout implementation from
>
> https://github.com/wurstmeister/storm-kafka-0.8-plus
>
> Thanks!
>
> On Mon, Feb 2, 2015 at 8:14 PM, Harsha <st...@harsha.io> wrote:
>
>
> Vineet,
>         Which kafka spout are you using?
>
> -Harsha
>
>
>
> On Mon, Feb 2, 2015, at 05:25 AM, Vineet Mishra wrote:
>
> Hi,
>
> I am running Kafka Storm Engine to process real time data generated on a 3
> node distributed cluster.
>
> Currently I have set 10 Executors for Storm Spout, which I don't think is
> running in parallel.
> Moreover earlier I was running the Kafka Topology with Replication Factor
> and Partitions as 1(which seems to have run comparatively faster), now I
> gave the Replication Factor as 3 and Partitions as 10 and I could see the
> performance degradation.
>
> Is there any way I can max utilize the available resource and get the max
> throughput of event processing.
>
> Looking for the expert suggestions at URGENT.
>
> Thanks!
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Storm Kafka Processing

Posted by Harsha <st...@harsha.io>.
Vineet, In kafka producer.send(KeyedMessage<Id, Message>) are you
passing in a ID. If this is constant or null your data won't be
distributed to all partitions. In case of constant Id all of your
messages goes to same partition and incase of null it chooses
round-robin to distribute among partitions. Its better to use a random
UUID to distribute among all of your partitions. -Harsha


On Tue, Feb 3, 2015, at 12:44 AM, Vineet Mishra wrote:
> Do you mean to say that the event published to Kafka is not partition
> distributed?
>
> Well while creating the topic I ensured to create # of partitions as
> 10 and replication factor as 3.
>
> Is it something effected as how I am writing to Kafka?
>
> Thanks!
>
> On Tue, Feb 3, 2015 at 1:50 PM, Andrew Neilson
> <ar...@gmail.com> wrote:
>> The behaviour you are describing sounds like your topology is
>> processing a small backlog of events built up in each partition and
>> then catching up to realtime where events are only being published to
>> one of the 10 partitions at a time. I will echo Harsha in suggesting
>> that you verify you are actually publishing to all partitions
>> (important: this is *not* the default behaviour).
>>
>> On Tue, Feb 3, 2015 at 12:05 AM, Vineet Mishra
>> <cl...@gmail.com> wrote:
>>> Hi Harsha,
>>>
>>> Based on the proposed metric, I ensured the specified changes by
>>> changing the Kafka-Storm Version bundle.
>>>
>>> Although I could see the difference from the last bundle used to the
>>> current change but was not satisfied by the way Spouts were
>>> processing. The observation which I had was,
>>>
>>> The Spout were running with Executor counts as 10, while initiating
>>> the job around half of the executors(5) started processing in
>>> parallel to ingest the data.
>>>
>>> As soon as the counts reached around a million or so the state of
>>> parallelism dropped and eventually it started processing in
>>> serially(One Executor at a time).
>>>
>>> Executors (All time) IdUptimeHostPortEmittedTransferredComplete
>>> latency (ms)AckedFailed [2-2]13m 54shost367030.000 [3-3]13m
>>> 52shost267023183003183004.789318160 [4-4]13m
>>> 52shost367024342004342007.064434380 [5-5]13m 53shost2670120200.000
>>> [6-6]13m 55shost367010.000 [7-7]13m 51shost2670025000250004.12224500
>>> [8-8]13m 51shost367002483602483609.514245780 [9-9]13m
>>> 52shost267030.000 [10-10]13m 54shost367032352202352209.250233200
>>> [11-11]13m 52shost2670220442020442010.382205800
>>>
>>> I am having around .2 Billion Events ingested to Kafka which needs
>>> to be processed through Storm in Real time but I am not sure what is
>>> making this unexpected intermittent behavior of the storm and how
>>> can I prevent this in near future.
>>>
>>> Expecting Expert Suggestions.
>>>
>>> Thanks!
>>>
>>>
>>>
>>> On Mon, Feb 2, 2015 at 11:53 PM, Vineet Mishra
>>> <cl...@gmail.com> wrote:
>>>> Well I am already running Kafka with 10 Partitions and Replication
>>>> factor as 3 which is the default size of my cluster.
>>>>
>>>> bin/kafka-topics.sh --create --zookeeper
>>>> host1:2181,host2:2181,host3:2181 --replication-factor 3
>>>> --partitions 10 --topic test
>>>>
>>>> and I am also running Kafka Storm topology with Executors count
>>>> as 10
>>>>
>>>> TopologyBuilder builder=new TopologyBuilder();
>>>> builder.setSpout("KafkaSpout", new KafkaSpout(kafkaConfig), 10);
>>>>
>>>> I am having a notion that since the time I have started running
>>>> Kafka from last* changed RF and # of Partitions I am landing up
>>>> with latency.
>>>>
>>>> * bin/kafka-topics.sh --create --zookeeper
>>>>   host1:2181,host2:2181,host3:2181 --replication-factor 1
>>>>   --partitions 1 --topic test
>>>>
>>>> Well I will try with above provided Storm Kafka bundle. Hope that
>>>> could help out!
>>>>
>>>> Thanks!
>>>>
>>>> On Mon, Feb 2, 2015 at 10:30 PM, Harsha <st...@harsha.io> wrote:
>>>>> __
>>>>> Vineet, Can you try using the one in storm
>>>>> https://github.com/apache/storm/tree/master/external/storm-kafka .
>>>>> This is published into maven repo. So you can use the following
>>>>> <dependency> <groupId>org.apache.storm</groupId>
>>>>> <artifactId>storm-kafka</artifactId> <version>0.9.3</version>
>>>>> </dependency>
>>>>>
>>>>> If you are using topic with partitions size 10 make sure you
>>>>> configured your kafka spout with parallelism set to 10. Also make
>>>>> sure on the producer side you are pushing data onto all of the 10
>>>>> partitions so that your kafka spout is fetching data from all of
>>>>> the 10 partitions.
>>>>>
>>>>> -Harsha
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 2, 2015, at 08:55 AM, Vineet Mishra wrote:
>>>>>> Hi Harsha,
>>>>>>
>>>>>> I am using storm.kafka.KafkaSpout.KafkaSpout implementation from
>>>>>>
>>>>>> https://github.com/wurstmeister/storm-kafka-0.8-plus
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> On Mon, Feb 2, 2015 at 8:14 PM, Harsha <st...@harsha.io> wrote:
>>>>>>> __
>>>>>>> Vineet, Which kafka spout are you using?
>>>>>>>
>>>>>>> -Harsha
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 2, 2015, at 05:25 AM, Vineet Mishra wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am running Kafka Storm Engine to process real time data
>>>>>>>> generated on a 3 node distributed cluster.
>>>>>>>>
>>>>>>>> Currently I have set 10 Executors for Storm Spout, which I
>>>>>>>> don't think is running in parallel. Moreover earlier I was
>>>>>>>> running the Kafka Topology with Replication Factor and
>>>>>>>> Partitions as 1(which seems to have run comparatively faster),
>>>>>>>> now I gave the Replication Factor as 3 and Partitions as 10 and
>>>>>>>> I could see the performance degradation.
>>>>>>>>
>>>>>>>> Is there any way I can max utilize the available resource and
>>>>>>>> get the max throughput of event processing.
>>>>>>>>
>>>>>>>> Looking for the expert suggestions at URGENT.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Storm Kafka Processing

Posted by Vineet Mishra <cl...@gmail.com>.
Do you mean to say that the event published to Kafka is not partition
distributed?

Well while creating the topic I ensured to create # of partitions as 10 and
replication factor as 3.

Is it something effected as how I am writing to Kafka?

Thanks!

On Tue, Feb 3, 2015 at 1:50 PM, Andrew Neilson <ar...@gmail.com> wrote:

> The behaviour you are describing sounds like your topology is processing a
> small backlog of events built up in each partition and then catching up to
> realtime where events are only being published to one of the 10 partitions
> at a time. I will echo Harsha in suggesting that you verify you are
> actually publishing to all partitions (important: this is *not* the
> default behaviour).
>
> On Tue, Feb 3, 2015 at 12:05 AM, Vineet Mishra <cl...@gmail.com>
> wrote:
>
>> Hi Harsha,
>>
>> Based on the proposed metric, I ensured the specified changes by changing
>> the Kafka-Storm Version bundle.
>>
>> Although I could see the difference from the last bundle used to the
>> current change but was not satisfied by the way Spouts were processing. The
>> observation which I had was,
>>
>> The Spout were running with Executor counts as 10, while initiating the
>> job around half of the executors(5) started processing in parallel to
>> ingest the data.
>>
>> As soon as the counts reached around a million or so the state of
>> parallelism dropped and eventually it started processing in serially(One
>> Executor at a time).
>>
>> Executors (All time)
>> Id Uptime Host Port Emitted Transferred Complete latency (ms) Acked
>> Failed
>> [2-2] 13m 54s host3 6703 0 0 0.000 0 0
>> [3-3] 13m 52s host2 6702 318300 318300 4.789 318160 0
>> [4-4] 13m 52s host3 6702 434200 434200 7.064 434380 0
>> [5-5] 13m 53s host2 6701 20 20 0.000 0 0
>> [6-6] 13m 55s host3 6701 0 0 0.000 0 0
>> [7-7] 13m 51s host2 6700 25000 25000 4.122 24500 0
>> [8-8] 13m 51s host3 6700 248360 248360 9.514 245780 0
>> [9-9] 13m 52s host2 6703 0 0 0.000 0 0
>> [10-10] 13m 54s host3 6703 235220 235220 9.250 233200 0
>> [11-11] 13m 52s host2 6702 204420 204420 10.382 205800 0
>>
>> I am having around .2 Billion Events ingested to Kafka which needs to be
>> processed through Storm in Real time but I am not sure what is making this
>> unexpected intermittent behavior of the storm and how can I prevent this in
>> near future.
>>
>> Expecting Expert Suggestions.
>>
>> Thanks!
>>
>>
>>
>> On Mon, Feb 2, 2015 at 11:53 PM, Vineet Mishra <cl...@gmail.com>
>> wrote:
>>
>>> Well I am already running Kafka with 10 Partitions and Replication
>>> factor as 3 which is the default size of my cluster.
>>>
>>> bin/kafka-topics.sh --create --zookeeper
>>> host1:2181,host2:2181,host3:2181 --replication-factor 3 --partitions 10
>>> --topic test
>>>
>>> and I am also running Kafka Storm topology with Executors count as 10
>>>
>>> TopologyBuilder builder=new TopologyBuilder();
>>>         builder.setSpout("KafkaSpout", new KafkaSpout(kafkaConfig), 10);
>>>
>>> I am having a notion that since the time I have started running Kafka
>>> from last* changed RF and # of Partitions I am landing up with latency.
>>>
>>> * bin/kafka-topics.sh --create --zookeeper
>>> host1:2181,host2:2181,host3:2181 --replication-factor 1 --partitions 1
>>> --topic test
>>>
>>> Well I will try with above provided Storm Kafka bundle. Hope that could
>>> help out!
>>>
>>> Thanks!
>>>
>>> On Mon, Feb 2, 2015 at 10:30 PM, Harsha <st...@harsha.io> wrote:
>>>
>>>>  Vineet,
>>>>        Can you try using the one in storm
>>>> https://github.com/apache/storm/tree/master/external/storm-kafka .
>>>> This is published into maven repo. So you can use the following
>>>> <dependency>
>>>> <groupId>org.apache.storm</groupId>
>>>> <artifactId>storm-kafka</artifactId>
>>>> <version>0.9.3</version>
>>>> </dependency>
>>>>
>>>> If you are using topic with partitions size 10 make sure you configured
>>>> your kafka spout with parallelism set to 10. Also make sure on the producer
>>>> side you are pushing data onto all of the 10 partitions so that your kafka
>>>> spout is fetching data from all of the 10 partitions.
>>>> -Harsha
>>>>
>>>>
>>>> On Mon, Feb 2, 2015, at 08:55 AM, Vineet Mishra wrote:
>>>>
>>>> Hi Harsha,
>>>>
>>>> I am using storm.kafka.KafkaSpout.KafkaSpout implementation from
>>>>
>>>> https://github.com/wurstmeister/storm-kafka-0.8-plus
>>>>
>>>> Thanks!
>>>>
>>>> On Mon, Feb 2, 2015 at 8:14 PM, Harsha <st...@harsha.io> wrote:
>>>>
>>>>
>>>> Vineet,
>>>>         Which kafka spout are you using?
>>>>
>>>> -Harsha
>>>>
>>>>
>>>>
>>>> On Mon, Feb 2, 2015, at 05:25 AM, Vineet Mishra wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am running Kafka Storm Engine to process real time data generated on
>>>> a 3 node distributed cluster.
>>>>
>>>> Currently I have set 10 Executors for Storm Spout, which I don't think
>>>> is running in parallel.
>>>> Moreover earlier I was running the Kafka Topology with Replication
>>>> Factor and Partitions as 1(which seems to have run comparatively faster),
>>>> now I gave the Replication Factor as 3 and Partitions as 10 and I could see
>>>> the performance degradation.
>>>>
>>>> Is there any way I can max utilize the available resource and get the
>>>> max throughput of event processing.
>>>>
>>>> Looking for the expert suggestions at URGENT.
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Storm Kafka Processing

Posted by Andrew Neilson <ar...@gmail.com>.
The behaviour you are describing sounds like your topology is processing a
small backlog of events built up in each partition and then catching up to
realtime where events are only being published to one of the 10 partitions
at a time. I will echo Harsha in suggesting that you verify you are
actually publishing to all partitions (important: this is *not* the default
behaviour).

On Tue, Feb 3, 2015 at 12:05 AM, Vineet Mishra <cl...@gmail.com>
wrote:

> Hi Harsha,
>
> Based on the proposed metric, I ensured the specified changes by changing
> the Kafka-Storm Version bundle.
>
> Although I could see the difference from the last bundle used to the
> current change but was not satisfied by the way Spouts were processing. The
> observation which I had was,
>
> The Spout were running with Executor counts as 10, while initiating the
> job around half of the executors(5) started processing in parallel to
> ingest the data.
>
> As soon as the counts reached around a million or so the state of
> parallelism dropped and eventually it started processing in serially(One
> Executor at a time).
>
> Executors (All time)
> Id Uptime Host Port Emitted Transferred Complete latency (ms) Acked Failed
> [2-2] 13m 54s host3 6703 0 0 0.000 0 0
> [3-3] 13m 52s host2 6702 318300 318300 4.789 318160 0
> [4-4] 13m 52s host3 6702 434200 434200 7.064 434380 0
> [5-5] 13m 53s host2 6701 20 20 0.000 0 0
> [6-6] 13m 55s host3 6701 0 0 0.000 0 0
> [7-7] 13m 51s host2 6700 25000 25000 4.122 24500 0
> [8-8] 13m 51s host3 6700 248360 248360 9.514 245780 0
> [9-9] 13m 52s host2 6703 0 0 0.000 0 0
> [10-10] 13m 54s host3 6703 235220 235220 9.250 233200 0
> [11-11] 13m 52s host2 6702 204420 204420 10.382 205800 0
>
> I am having around .2 Billion Events ingested to Kafka which needs to be
> processed through Storm in Real time but I am not sure what is making this
> unexpected intermittent behavior of the storm and how can I prevent this in
> near future.
>
> Expecting Expert Suggestions.
>
> Thanks!
>
>
>
> On Mon, Feb 2, 2015 at 11:53 PM, Vineet Mishra <cl...@gmail.com>
> wrote:
>
>> Well I am already running Kafka with 10 Partitions and Replication factor
>> as 3 which is the default size of my cluster.
>>
>> bin/kafka-topics.sh --create --zookeeper host1:2181,host2:2181,host3:2181
>> --replication-factor 3 --partitions 10 --topic test
>>
>> and I am also running Kafka Storm topology with Executors count as 10
>>
>> TopologyBuilder builder=new TopologyBuilder();
>>         builder.setSpout("KafkaSpout", new KafkaSpout(kafkaConfig), 10);
>>
>> I am having a notion that since the time I have started running Kafka
>> from last* changed RF and # of Partitions I am landing up with latency.
>>
>> * bin/kafka-topics.sh --create --zookeeper
>> host1:2181,host2:2181,host3:2181 --replication-factor 1 --partitions 1
>> --topic test
>>
>> Well I will try with above provided Storm Kafka bundle. Hope that could
>> help out!
>>
>> Thanks!
>>
>> On Mon, Feb 2, 2015 at 10:30 PM, Harsha <st...@harsha.io> wrote:
>>
>>>  Vineet,
>>>        Can you try using the one in storm
>>> https://github.com/apache/storm/tree/master/external/storm-kafka . This
>>> is published into maven repo. So you can use the following
>>> <dependency>
>>> <groupId>org.apache.storm</groupId>
>>> <artifactId>storm-kafka</artifactId>
>>> <version>0.9.3</version>
>>> </dependency>
>>>
>>> If you are using topic with partitions size 10 make sure you configured
>>> your kafka spout with parallelism set to 10. Also make sure on the producer
>>> side you are pushing data onto all of the 10 partitions so that your kafka
>>> spout is fetching data from all of the 10 partitions.
>>> -Harsha
>>>
>>>
>>> On Mon, Feb 2, 2015, at 08:55 AM, Vineet Mishra wrote:
>>>
>>> Hi Harsha,
>>>
>>> I am using storm.kafka.KafkaSpout.KafkaSpout implementation from
>>>
>>> https://github.com/wurstmeister/storm-kafka-0.8-plus
>>>
>>> Thanks!
>>>
>>> On Mon, Feb 2, 2015 at 8:14 PM, Harsha <st...@harsha.io> wrote:
>>>
>>>
>>> Vineet,
>>>         Which kafka spout are you using?
>>>
>>> -Harsha
>>>
>>>
>>>
>>> On Mon, Feb 2, 2015, at 05:25 AM, Vineet Mishra wrote:
>>>
>>> Hi,
>>>
>>> I am running Kafka Storm Engine to process real time data generated on a
>>> 3 node distributed cluster.
>>>
>>> Currently I have set 10 Executors for Storm Spout, which I don't think
>>> is running in parallel.
>>> Moreover earlier I was running the Kafka Topology with Replication
>>> Factor and Partitions as 1(which seems to have run comparatively faster),
>>> now I gave the Replication Factor as 3 and Partitions as 10 and I could see
>>> the performance degradation.
>>>
>>> Is there any way I can max utilize the available resource and get the
>>> max throughput of event processing.
>>>
>>> Looking for the expert suggestions at URGENT.
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: Storm Kafka Processing

Posted by Vineet Mishra <cl...@gmail.com>.
Hi Harsha,

Based on the proposed metric, I ensured the specified changes by changing
the Kafka-Storm Version bundle.

Although I could see the difference from the last bundle used to the
current change but was not satisfied by the way Spouts were processing. The
observation which I had was,

The Spout were running with Executor counts as 10, while initiating the job
around half of the executors(5) started processing in parallel to ingest
the data.

As soon as the counts reached around a million or so the state of
parallelism dropped and eventually it started processing in serially(One
Executor at a time).

Executors (All time)
Id Uptime Host Port Emitted Transferred Complete latency (ms) Acked Failed
[2-2] 13m 54s host3 6703 0 0 0.000 0 0
[3-3] 13m 52s host2 6702 318300 318300 4.789 318160 0
[4-4] 13m 52s host3 6702 434200 434200 7.064 434380 0
[5-5] 13m 53s host2 6701 20 20 0.000 0 0
[6-6] 13m 55s host3 6701 0 0 0.000 0 0
[7-7] 13m 51s host2 6700 25000 25000 4.122 24500 0
[8-8] 13m 51s host3 6700 248360 248360 9.514 245780 0
[9-9] 13m 52s host2 6703 0 0 0.000 0 0
[10-10] 13m 54s host3 6703 235220 235220 9.250 233200 0
[11-11] 13m 52s host2 6702 204420 204420 10.382 205800 0

I am having around .2 Billion Events ingested to Kafka which needs to be
processed through Storm in Real time but I am not sure what is making this
unexpected intermittent behavior of the storm and how can I prevent this in
near future.

Expecting Expert Suggestions.

Thanks!



On Mon, Feb 2, 2015 at 11:53 PM, Vineet Mishra <cl...@gmail.com>
wrote:

> Well I am already running Kafka with 10 Partitions and Replication factor
> as 3 which is the default size of my cluster.
>
> bin/kafka-topics.sh --create --zookeeper host1:2181,host2:2181,host3:2181
> --replication-factor 3 --partitions 10 --topic test
>
> and I am also running Kafka Storm topology with Executors count as 10
>
> TopologyBuilder builder=new TopologyBuilder();
>         builder.setSpout("KafkaSpout", new KafkaSpout(kafkaConfig), 10);
>
> I am having a notion that since the time I have started running Kafka from
> last* changed RF and # of Partitions I am landing up with latency.
>
> * bin/kafka-topics.sh --create --zookeeper
> host1:2181,host2:2181,host3:2181 --replication-factor 1 --partitions 1
> --topic test
>
> Well I will try with above provided Storm Kafka bundle. Hope that could
> help out!
>
> Thanks!
>
> On Mon, Feb 2, 2015 at 10:30 PM, Harsha <st...@harsha.io> wrote:
>
>>  Vineet,
>>        Can you try using the one in storm
>> https://github.com/apache/storm/tree/master/external/storm-kafka . This
>> is published into maven repo. So you can use the following
>> <dependency>
>> <groupId>org.apache.storm</groupId>
>> <artifactId>storm-kafka</artifactId>
>> <version>0.9.3</version>
>> </dependency>
>>
>> If you are using topic with partitions size 10 make sure you configured
>> your kafka spout with parallelism set to 10. Also make sure on the producer
>> side you are pushing data onto all of the 10 partitions so that your kafka
>> spout is fetching data from all of the 10 partitions.
>> -Harsha
>>
>>
>> On Mon, Feb 2, 2015, at 08:55 AM, Vineet Mishra wrote:
>>
>> Hi Harsha,
>>
>> I am using storm.kafka.KafkaSpout.KafkaSpout implementation from
>>
>> https://github.com/wurstmeister/storm-kafka-0.8-plus
>>
>> Thanks!
>>
>> On Mon, Feb 2, 2015 at 8:14 PM, Harsha <st...@harsha.io> wrote:
>>
>>
>> Vineet,
>>         Which kafka spout are you using?
>>
>> -Harsha
>>
>>
>>
>> On Mon, Feb 2, 2015, at 05:25 AM, Vineet Mishra wrote:
>>
>> Hi,
>>
>> I am running Kafka Storm Engine to process real time data generated on a
>> 3 node distributed cluster.
>>
>> Currently I have set 10 Executors for Storm Spout, which I don't think is
>> running in parallel.
>> Moreover earlier I was running the Kafka Topology with Replication Factor
>> and Partitions as 1(which seems to have run comparatively faster), now I
>> gave the Replication Factor as 3 and Partitions as 10 and I could see the
>> performance degradation.
>>
>> Is there any way I can max utilize the available resource and get the max
>> throughput of event processing.
>>
>> Looking for the expert suggestions at URGENT.
>>
>> Thanks!
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Storm Kafka Processing

Posted by Vineet Mishra <cl...@gmail.com>.
Well I am already running Kafka with 10 Partitions and Replication factor
as 3 which is the default size of my cluster.

bin/kafka-topics.sh --create --zookeeper host1:2181,host2:2181,host3:2181
--replication-factor 3 --partitions 10 --topic test

and I am also running Kafka Storm topology with Executors count as 10

TopologyBuilder builder=new TopologyBuilder();
        builder.setSpout("KafkaSpout", new KafkaSpout(kafkaConfig), 10);

I am having a notion that since the time I have started running Kafka from
last* changed RF and # of Partitions I am landing up with latency.

* bin/kafka-topics.sh --create --zookeeper host1:2181,host2:2181,host3:2181
--replication-factor 1 --partitions 1 --topic test

Well I will try with above provided Storm Kafka bundle. Hope that could
help out!

Thanks!

On Mon, Feb 2, 2015 at 10:30 PM, Harsha <st...@harsha.io> wrote:

>  Vineet,
>        Can you try using the one in storm
> https://github.com/apache/storm/tree/master/external/storm-kafka . This
> is published into maven repo. So you can use the following
> <dependency>
> <groupId>org.apache.storm</groupId>
> <artifactId>storm-kafka</artifactId>
> <version>0.9.3</version>
> </dependency>
>
> If you are using topic with partitions size 10 make sure you configured
> your kafka spout with parallelism set to 10. Also make sure on the producer
> side you are pushing data onto all of the 10 partitions so that your kafka
> spout is fetching data from all of the 10 partitions.
> -Harsha
>
>
> On Mon, Feb 2, 2015, at 08:55 AM, Vineet Mishra wrote:
>
> Hi Harsha,
>
> I am using storm.kafka.KafkaSpout.KafkaSpout implementation from
>
> https://github.com/wurstmeister/storm-kafka-0.8-plus
>
> Thanks!
>
> On Mon, Feb 2, 2015 at 8:14 PM, Harsha <st...@harsha.io> wrote:
>
>
> Vineet,
>         Which kafka spout are you using?
>
> -Harsha
>
>
>
> On Mon, Feb 2, 2015, at 05:25 AM, Vineet Mishra wrote:
>
> Hi,
>
> I am running Kafka Storm Engine to process real time data generated on a 3
> node distributed cluster.
>
> Currently I have set 10 Executors for Storm Spout, which I don't think is
> running in parallel.
> Moreover earlier I was running the Kafka Topology with Replication Factor
> and Partitions as 1(which seems to have run comparatively faster), now I
> gave the Replication Factor as 3 and Partitions as 10 and I could see the
> performance degradation.
>
> Is there any way I can max utilize the available resource and get the max
> throughput of event processing.
>
> Looking for the expert suggestions at URGENT.
>
> Thanks!
>
>
>
>
>
>
>

Re: Storm Kafka Processing

Posted by Harsha <st...@harsha.io>.
Vineet, Can you try using the one in storm
https://github.com/apache/storm/tree/master/external/storm-kafka . This
is published into maven repo. So you can use the following <dependency>
<groupId>org.apache.storm</groupId> <artifactId>storm-kafka</artifactId>
<version>0.9.3</version> </dependency>

If you are using topic with partitions size 10 make sure you configured
your kafka spout with parallelism set to 10. Also make sure on the
producer side you are pushing data onto all of the 10 partitions so that
your kafka spout is fetching data from all of the 10 partitions. -Harsha


On Mon, Feb 2, 2015, at 08:55 AM, Vineet Mishra wrote:
> Hi Harsha,
>
> I am using storm.kafka.KafkaSpout.KafkaSpout implementation from
>
> https://github.com/wurstmeister/storm-kafka-0.8-plus
>
> Thanks!
>
> On Mon, Feb 2, 2015 at 8:14 PM, Harsha <st...@harsha.io> wrote:
>> __
>> Vineet, Which kafka spout are you using?
>>
>> -Harsha
>>
>>
>>
>> On Mon, Feb 2, 2015, at 05:25 AM, Vineet Mishra wrote:
>>> Hi,
>>>
>>> I am running Kafka Storm Engine to process real time data generated
>>> on a 3 node distributed cluster.
>>>
>>> Currently I have set 10 Executors for Storm Spout, which I don't
>>> think is running in parallel. Moreover earlier I was running the
>>> Kafka Topology with Replication Factor and Partitions as 1(which
>>> seems to have run comparatively faster), now I gave the Replication
>>> Factor as 3 and Partitions as 10 and I could see the performance
>>> degradation.
>>>
>>> Is there any way I can max utilize the available resource and get
>>> the max throughput of event processing.
>>>
>>> Looking for the expert suggestions at URGENT.
>>>
>>> Thanks!
>>
>


Re: Storm Kafka Processing

Posted by Vineet Mishra <cl...@gmail.com>.
Hi Harsha,

I am using storm.kafka.KafkaSpout.KafkaSpout implementation from

https://github.com/wurstmeister/storm-kafka-0.8-plus

Thanks!

On Mon, Feb 2, 2015 at 8:14 PM, Harsha <st...@harsha.io> wrote:

>  Vineet,
>         Which kafka spout are you using?
> -Harsha
>
>
> On Mon, Feb 2, 2015, at 05:25 AM, Vineet Mishra wrote:
>
> Hi,
>
> I am running Kafka Storm Engine to process real time data generated on a 3
> node distributed cluster.
>
> Currently I have set 10 Executors for Storm Spout, which I don't think is
> running in parallel.
> Moreover earlier I was running the Kafka Topology with Replication Factor
> and Partitions as 1(which seems to have run comparatively faster), now I
> gave the Replication Factor as 3 and Partitions as 10 and I could see the
> performance degradation.
>
> Is there any way I can max utilize the available resource and get the max
> throughput of event processing.
>
> Looking for the expert suggestions at URGENT.
>
> Thanks!
>
>
>

Re: Storm Kafka Processing

Posted by Harsha <st...@harsha.io>.
Vineet, Which kafka spout are you using? -Harsha


On Mon, Feb 2, 2015, at 05:25 AM, Vineet Mishra wrote:
> Hi,
>
> I am running Kafka Storm Engine to process real time data generated on
> a 3 node distributed cluster.
>
> Currently I have set 10 Executors for Storm Spout, which I don't think
> is running in parallel. Moreover earlier I was running the Kafka
> Topology with Replication Factor and Partitions as 1(which seems to
> have run comparatively faster), now I gave the Replication Factor as 3
> and Partitions as 10 and I could see the performance degradation.
>
> Is there any way I can max utilize the available resource and get the
> max throughput of event processing.
>
> Looking for the expert suggestions at URGENT.
>
> Thanks!