You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Vineet Mishra <cl...@gmail.com> on 2015/02/03 12:03:28 UTC

Logstash to Kafka

Hi,

I am having a setup where I am sniffing some logs(ofcourse the big ones)
through Logstash Forwarder and forwarding it to Logstash, which in turn
publish these events to Kafka.

I have created the Kafka Topic ensuring the required number of Partitions
and Replication Factor but not sure with Logstash Output Configuration, I
am having following doubt with reference to the same.

For the Logstash Publishing events to kafka

1) Do we need to explicitly define the Partition in Logstash while
Publishing to Kafka
2) Will Kafka take care of the proper distribution of the data across the
Partitions

I am having a notion that despite of the fact of declaring the partitions
in Kafka while creating Topic the data from Logstash is been pushed to
single Partition or perhaps not getting uniformly distributed.

Looking for the Expert Advise.

Thanks!

Re: Logstash to Kafka

Posted by Otis Gospodnetic <ot...@gmail.com>.
Hi,

In short, I don't see Kafka having problems with those numbers.  Logstash
will have a harder time, I believe.
That said, it all depends on how you tune things an what kind of / how much
hardware you use.

2B or 200B events, yes, big numbers, but how quickly do you need to process
those? in 1 minute, 1 hour, 1 day, or a week? :)

SPM for Kafka (http://sematext.com/spm) will show you all possible Kafka
metrics you can imagine, so if you decide to give Kafka a try you'll be
able to tune Kafka with the help of SPM for Kafka charts and the help of
people on this mailing list.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Thu, Feb 5, 2015 at 2:12 PM, Vineet Mishra <cl...@gmail.com>
wrote:

> Yury,
>
> Well thanks for sharing the insight of kafka partition distribution.
>
> Well I am more of a concerned about the throughtput that kafka-storm can
> collaborative give so as to event process.
>
> Currently I am having around a 30 Gb file with around .2 Billion events,
> this number is soon gonna rise 100 times the existing numbers.
>
> I was wondering will the above mentioned stream processing engine will be
> good fit in my case?
> If yes, then with what configuration and tuning so as to effectively use
> resource and maximize throughput.
>
> Thanks!
> On Feb 3, 2015 8:38 PM, "Yury Ruchin" <yu...@gmail.com> wrote:
>
> > This is a quote from Kafka documentation:
> > "The routing decision is influenced by the kafka.producer.Partitioner.
> >
> > interface Partitioner<T> {
> >    int partition(T key, int numPartitions);
> > }
> > The partition API uses the key and the number of available broker
> > partitions to return a partition id. This id is used as an index into a
> > sorted list of broker_ids and partitions to pick a broker partition for
> the
> > producer request. The default partitioning strategy is
> > hash(key)%numPartitions. If the key is null, then a random broker
> partition
> > is picked. A custom partitioning strategy can also be plugged in using
> the
> > partitioner.class config parameter."
> >
> > An important point for the null key is that the randomly chosen broker
> > partition sticks for the time specified by "
> > topic.metadata.refresh.interval.ms" which is 10 minutes by default. So
> if
> > you are using null key for Logstash entries, you will be writing to the
> > same partition for 10 minutes. Is this your case?
> >
> > 2015-02-03 14:03 GMT+03:00 Vineet Mishra <cl...@gmail.com>:
> >
> > > Hi,
> > >
> > > I am having a setup where I am sniffing some logs(ofcourse the big
> ones)
> > > through Logstash Forwarder and forwarding it to Logstash, which in turn
> > > publish these events to Kafka.
> > >
> > > I have created the Kafka Topic ensuring the required number of
> Partitions
> > > and Replication Factor but not sure with Logstash Output
> Configuration, I
> > > am having following doubt with reference to the same.
> > >
> > > For the Logstash Publishing events to kafka
> > >
> > > 1) Do we need to explicitly define the Partition in Logstash while
> > > Publishing to Kafka
> > > 2) Will Kafka take care of the proper distribution of the data across
> the
> > > Partitions
> > >
> > > I am having a notion that despite of the fact of declaring the
> partitions
> > > in Kafka while creating Topic the data from Logstash is been pushed to
> > > single Partition or perhaps not getting uniformly distributed.
> > >
> > > Looking for the Expert Advise.
> > >
> > > Thanks!
> > >
> >
>

Re: Logstash to Kafka

Posted by Vineet Mishra <cl...@gmail.com>.
Yury,

Well thanks for sharing the insight of kafka partition distribution.

Well I am more of a concerned about the throughtput that kafka-storm can
collaborative give so as to event process.

Currently I am having around a 30 Gb file with around .2 Billion events,
this number is soon gonna rise 100 times the existing numbers.

I was wondering will the above mentioned stream processing engine will be
good fit in my case?
If yes, then with what configuration and tuning so as to effectively use
resource and maximize throughput.

Thanks!
On Feb 3, 2015 8:38 PM, "Yury Ruchin" <yu...@gmail.com> wrote:

> This is a quote from Kafka documentation:
> "The routing decision is influenced by the kafka.producer.Partitioner.
>
> interface Partitioner<T> {
>    int partition(T key, int numPartitions);
> }
> The partition API uses the key and the number of available broker
> partitions to return a partition id. This id is used as an index into a
> sorted list of broker_ids and partitions to pick a broker partition for the
> producer request. The default partitioning strategy is
> hash(key)%numPartitions. If the key is null, then a random broker partition
> is picked. A custom partitioning strategy can also be plugged in using the
> partitioner.class config parameter."
>
> An important point for the null key is that the randomly chosen broker
> partition sticks for the time specified by "
> topic.metadata.refresh.interval.ms" which is 10 minutes by default. So if
> you are using null key for Logstash entries, you will be writing to the
> same partition for 10 minutes. Is this your case?
>
> 2015-02-03 14:03 GMT+03:00 Vineet Mishra <cl...@gmail.com>:
>
> > Hi,
> >
> > I am having a setup where I am sniffing some logs(ofcourse the big ones)
> > through Logstash Forwarder and forwarding it to Logstash, which in turn
> > publish these events to Kafka.
> >
> > I have created the Kafka Topic ensuring the required number of Partitions
> > and Replication Factor but not sure with Logstash Output Configuration, I
> > am having following doubt with reference to the same.
> >
> > For the Logstash Publishing events to kafka
> >
> > 1) Do we need to explicitly define the Partition in Logstash while
> > Publishing to Kafka
> > 2) Will Kafka take care of the proper distribution of the data across the
> > Partitions
> >
> > I am having a notion that despite of the fact of declaring the partitions
> > in Kafka while creating Topic the data from Logstash is been pushed to
> > single Partition or perhaps not getting uniformly distributed.
> >
> > Looking for the Expert Advise.
> >
> > Thanks!
> >
>

Re: Logstash to Kafka

Posted by Yury Ruchin <yu...@gmail.com>.
This is a quote from Kafka documentation:
"The routing decision is influenced by the kafka.producer.Partitioner.

interface Partitioner<T> {
   int partition(T key, int numPartitions);
}
The partition API uses the key and the number of available broker
partitions to return a partition id. This id is used as an index into a
sorted list of broker_ids and partitions to pick a broker partition for the
producer request. The default partitioning strategy is
hash(key)%numPartitions. If the key is null, then a random broker partition
is picked. A custom partitioning strategy can also be plugged in using the
partitioner.class config parameter."

An important point for the null key is that the randomly chosen broker
partition sticks for the time specified by "
topic.metadata.refresh.interval.ms" which is 10 minutes by default. So if
you are using null key for Logstash entries, you will be writing to the
same partition for 10 minutes. Is this your case?

2015-02-03 14:03 GMT+03:00 Vineet Mishra <cl...@gmail.com>:

> Hi,
>
> I am having a setup where I am sniffing some logs(ofcourse the big ones)
> through Logstash Forwarder and forwarding it to Logstash, which in turn
> publish these events to Kafka.
>
> I have created the Kafka Topic ensuring the required number of Partitions
> and Replication Factor but not sure with Logstash Output Configuration, I
> am having following doubt with reference to the same.
>
> For the Logstash Publishing events to kafka
>
> 1) Do we need to explicitly define the Partition in Logstash while
> Publishing to Kafka
> 2) Will Kafka take care of the proper distribution of the data across the
> Partitions
>
> I am having a notion that despite of the fact of declaring the partitions
> in Kafka while creating Topic the data from Logstash is been pushed to
> single Partition or perhaps not getting uniformly distributed.
>
> Looking for the Expert Advise.
>
> Thanks!
>