You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Bhavesh Mistry <mi...@gmail.com> on 2014/05/23 21:49:39 UTC

Topic Partitioning Strategy For Large Data

Hi Kafka Users,



We are trying to transport 4TB data per day on single topic.  It is
operation application logs.    How do we estimate number of partitions and
partitioning strategy?   Our goal is to drain (from consumer side) from
the Kafka Brokers as soon as messages arrive (keep the lag as minimum as
possible) and also we would like to uniformly distribute the logs across
all partitions.



Here is our Brokers HW Spec:

3 Broker Cluster (192 GB RAM, 32 Cores each with SSD to hold 7 days of data
) with 100G NIC



Data Rate :    ~ 13 GB per minute





Is there a formula to compute optimal number of partition need  ?  Also,  how
to ensure uniform distribution from the producer side  (currently we have
counter % numPartitions  which is not viable solution in prod env)



Thanks,
Bhavesh

Re: Topic Partitioning Strategy For Large Data

Posted by Drew Goya <dr...@gradientx.com>.
A few things I've learned:

1) Don't break things up into separate topics unless the data in them is
truly independent.  Consumer behavior can be extremely variable, don't
assume you will always be consuming as fast as you are  producing.

2) Keep time related messages in the same partition.  Again, consumer
behavior can (and will be) extremely variable, don't assume the lag on all
your partitions will be similar.  Design a partitioning scheme, so that the
owner of one partition can stop consuming for a long period of time and
your application will be minimally impacted. (for example, partitioning by
transaction id)


On Fri, May 23, 2014 at 1:12 PM, Joel Koshy <jj...@gmail.com> wrote:

> Take a look at:
>
> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowdoIchoosethenumberofpartitionsforatopic
> ?
>
> On Fri, May 23, 2014 at 12:49:39PM -0700, Bhavesh Mistry wrote:
> > Hi Kafka Users,
> >
> >
> >
> > We are trying to transport 4TB data per day on single topic.  It is
> > operation application logs.    How do we estimate number of partitions
> and
> > partitioning strategy?   Our goal is to drain (from consumer side) from
> > the Kafka Brokers as soon as messages arrive (keep the lag as minimum as
> > possible) and also we would like to uniformly distribute the logs across
> > all partitions.
> >
> >
> >
> > Here is our Brokers HW Spec:
> >
> > 3 Broker Cluster (192 GB RAM, 32 Cores each with SSD to hold 7 days of
> data
> > ) with 100G NIC
> >
> >
> >
> > Data Rate :    ~ 13 GB per minute
> >
> >
> >
> >
> >
> > Is there a formula to compute optimal number of partition need  ?  Also,
>  how
> > to ensure uniform distribution from the producer side  (currently we have
> > counter % numPartitions  which is not viable solution in prod env)
> >
> >
> >
> > Thanks,
> > Bhavesh
>
>

Re: Topic Partitioning Strategy For Large Data

Posted by Drew Goya <dr...@gradientx.com>.
A few things I've learned:

1) Don't break things up into separate topics unless the data in them is
truly independent.  Consumer behavior can be extremely variable, don't
assume you will always be consuming as fast as you are  producing.

2) Keep time related messages in the same partition.  Again, consumer
behavior can (and will be) extremely variable, don't assume the lag on all
your partitions will be similar.  Design a partitioning scheme, so that the
owner of one partition can stop consuming for a long period of time and
your application will be minimally impacted. (for example, partitioning by
transaction id)


On Fri, May 23, 2014 at 1:12 PM, Joel Koshy <jj...@gmail.com> wrote:

> Take a look at:
>
> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowdoIchoosethenumberofpartitionsforatopic
> ?
>
> On Fri, May 23, 2014 at 12:49:39PM -0700, Bhavesh Mistry wrote:
> > Hi Kafka Users,
> >
> >
> >
> > We are trying to transport 4TB data per day on single topic.  It is
> > operation application logs.    How do we estimate number of partitions
> and
> > partitioning strategy?   Our goal is to drain (from consumer side) from
> > the Kafka Brokers as soon as messages arrive (keep the lag as minimum as
> > possible) and also we would like to uniformly distribute the logs across
> > all partitions.
> >
> >
> >
> > Here is our Brokers HW Spec:
> >
> > 3 Broker Cluster (192 GB RAM, 32 Cores each with SSD to hold 7 days of
> data
> > ) with 100G NIC
> >
> >
> >
> > Data Rate :    ~ 13 GB per minute
> >
> >
> >
> >
> >
> > Is there a formula to compute optimal number of partition need  ?  Also,
>  how
> > to ensure uniform distribution from the producer side  (currently we have
> > counter % numPartitions  which is not viable solution in prod env)
> >
> >
> >
> > Thanks,
> > Bhavesh
>
>

Re: Topic Partitioning Strategy For Large Data

Posted by Joel Koshy <jj...@gmail.com>.
Take a look at:
https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowdoIchoosethenumberofpartitionsforatopic?

On Fri, May 23, 2014 at 12:49:39PM -0700, Bhavesh Mistry wrote:
> Hi Kafka Users,
> 
> 
> 
> We are trying to transport 4TB data per day on single topic.  It is
> operation application logs.    How do we estimate number of partitions and
> partitioning strategy?   Our goal is to drain (from consumer side) from
> the Kafka Brokers as soon as messages arrive (keep the lag as minimum as
> possible) and also we would like to uniformly distribute the logs across
> all partitions.
> 
> 
> 
> Here is our Brokers HW Spec:
> 
> 3 Broker Cluster (192 GB RAM, 32 Cores each with SSD to hold 7 days of data
> ) with 100G NIC
> 
> 
> 
> Data Rate :    ~ 13 GB per minute
> 
> 
> 
> 
> 
> Is there a formula to compute optimal number of partition need  ?  Also,  how
> to ensure uniform distribution from the producer side  (currently we have
> counter % numPartitions  which is not viable solution in prod env)
> 
> 
> 
> Thanks,
> Bhavesh


Re: Topic Partitioning Strategy For Large Data

Posted by Joel Koshy <jj...@gmail.com>.
Take a look at:
https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowdoIchoosethenumberofpartitionsforatopic?

On Fri, May 23, 2014 at 12:49:39PM -0700, Bhavesh Mistry wrote:
> Hi Kafka Users,
> 
> 
> 
> We are trying to transport 4TB data per day on single topic.  It is
> operation application logs.    How do we estimate number of partitions and
> partitioning strategy?   Our goal is to drain (from consumer side) from
> the Kafka Brokers as soon as messages arrive (keep the lag as minimum as
> possible) and also we would like to uniformly distribute the logs across
> all partitions.
> 
> 
> 
> Here is our Brokers HW Spec:
> 
> 3 Broker Cluster (192 GB RAM, 32 Cores each with SSD to hold 7 days of data
> ) with 100G NIC
> 
> 
> 
> Data Rate :    ~ 13 GB per minute
> 
> 
> 
> 
> 
> Is there a formula to compute optimal number of partition need  ?  Also,  how
> to ensure uniform distribution from the producer side  (currently we have
> counter % numPartitions  which is not viable solution in prod env)
> 
> 
> 
> Thanks,
> Bhavesh