You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Rahul Kavale <ka...@gmail.com> on 2015/09/23 21:28:10 UTC

Kafka Spout emitting duplicate messages

Hi all,

I have been using Storm(0.10.0-beta) with Kafka(0.8.2) for building real
time data ingestion system.

The Kafka topic on which input messages arrive, is just having single
partition and replication factor of 1 for the topic.

The problem I am facing is I am seeing duplicate messages read from spouts.

The number of duplicates is same as number of workers/machines I have in
the storm cluster.

When debugged, I found duplicate kafkaSpout instances which are reading
from same topic from Kakfa, below is the config for the spout.

builder.setSpout(spoutId, new KafkaSpout(kafkaConfig),1)
.setNumTasks(1)
.setMaxTaskParallelism(1)

Even with above config, I can see multiple instances of the spout running
which is consuming from same topic on kafka, resulting in duplicate
messages.

I tried setting number of workers for the topology to 1, as,
config.setNumWorkers(1)

Even with above configuration, there are still 3 instances of the spout
running.

The topology works fine, i.e. no duplicate messages are read from Kafka
topic when it is run in local mode, but duplicate messages are read when
the storm topology is run in distributed mode, even if number of workers
are set to 1 for the topology.

This
<https://groups.google.com/forum/#!searchin/storm-user/kafkaspout/storm-user/vzAlIhAOntw/yo_-rUs8cj0J>
questions
mentions similar problem being solved with using KafkaSpout from
storm-kafka package, unfortunately its not working for me.
Similar
<https://www.quora.com/If-I-increase-the-parallelism-of-a-Kafka-spout-in-my-storm-topology-how-can-I-stop-it-from-reading-the-same-message-in-a-topic-multiple-times>
question
on Quora, says just setting the parallelism hint should work.
Question on stackoverflow
<http://stackoverflow.com/questions/18267834/storm-kafka-multiple-spouts-how-to-share-the-load>
mentions
multiple Kafka Spout achieved with parallelism hint again.

Unfortunately none of the above suggestions are working out for me as
expected.

How can I make sure my KafkaSpout reads no duplicate messages or how can I
create just single instance for the spout?

Thanks & Regards,
Rahul Kavale