You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Aniket Bhatnagar <an...@gmail.com> on 2013/10/04 11:15:11 UTC

Is 30 a too high partition number?

I am using kafka as a buffer for data streaming in from various sources.
Since its a time series data, I generate the key to the message by
combining source ID and minute in the timestamp. This means I can utmost
have 60 partitions per topic (as each source has its own topic). I have
set num.partitions to be 30 (60/2) for each topic in broker config. I don't
have a very good reason to pick 30 as default number of partitions per
topic but I wanted it to be a high number so that I can achieve high
parallelism during in-stream processing. I am worried that having a high
number  like 30 (default configuration had it as 2), it can negatively
impact kafka performance in terms of message throughput or memory
consumption. I understand that this can lead to many files per partition
but I am thinking of dealing with it by having multiple directories on the
same disk if at all I run into issues.

My question to the community is that am I prematurely attempting to
optimizing the partition number as right now even a partition number of 5
seems sufficient and hence will run into unwanted issues? Or is 30 an Ok
number to use for number of partitions?

Re: Is 30 a too high partition number?

Posted by Philip O'Toole <ph...@loggly.com>.

I would like to second that. It would be real useful. 

Philip

On Oct 8, 2013, at 9:31 AM, Jason Rosenberg <jb...@squareup.com> wrote:

> What I would like to see is a way for inactive topics to automatically get
> removed after they are inactive for a period of time.  That might help in
> this case.
> 
> I added a comment to this larger jira:
> https://issues.apache.org/jira/browse/KAFKA-330
> 
> Perhaps it should really be it's own jira entry.
> 
> Jason
> 
> 
> On Tue, Oct 8, 2013 at 10:29 AM, Aniket Bhatnagar <
> aniket.bhatnagar@gmail.com> wrote:
> 
>> Thanks Neha. Is it worthwhile to investigate an option to store topic
>> metadata (partitions, etc) into another consistent data store (MySQL,
>> HBase, etc)? Should we make this feature pluggable?
>> 
>> The reason I am thinking we may need to go surpass the 2000 total partition
>> limit is because there may be genuine use cases to have high number of
>> topics. For example, in my particular case, I am using Kafka as a buffer to
>> store data arriving from various sensors deployed in physical world. These
>> sensors may be short lived or may be long lived. I was thinking of having
>> individual topics for each sensor. This ways, if a badly behaving sensor
>> attempts to pushes the data at a much faster rate than we can process as a
>> Kafka consumer, we will eventually overflow and start losing data for that
>> particular sensor. However, we can still potentially continue to process
>> data from other sensors that are pushing data at manageable rate. If I go
>> with 1 topic for all the sensors, 1 misbehaving sensor can potentially lead
>> us not catching up with the topic in the retention period thus making us
>> loose data from all sensors.
>> 
>> The other issue is that if we go with a topic per sensor and the sensors
>> are short lived and we have reached a threshold of 2000 sensors already
>> deployed, Kafka will stop working (because of Zookeeper limitation) if
>> though the previously deployed sensors may not be active at all.
>> 
>> I am sure there may be other genuine use cases for having topics much
>> larger than 2000.
>> 
>> 
>> On 4 October 2013 19:04, Neha Narkhede <ne...@gmail.com> wrote:
>> 
>>> You probably want to think of this in terms of number of partitions on a
>>> single broker, instead of per topic since I/O is the limiting factor in
>>> this case. Another factor to consider is total number of partitions in
>> the
>>> cluster as Zookeeper becomes a limiting factor there. 30 partitions is
>> not
>>> too large provided the total number of partitions doesn't exceed roughly
>>> couple thousand. To give you an example, some of our clusters are 16
>> nodes
>>> big and some of the topics on those clusters have 30 partitions.
>>> 
>>> Thanks,
>>> Neha
>>> On Oct 4, 2013 4:15 AM, "Aniket Bhatnagar" <an...@gmail.com>
>>> wrote:
>>> 
>>>> I am using kafka as a buffer for data streaming in from various
>> sources.
>>>> Since its a time series data, I generate the key to the message by
>>>> combining source ID and minute in the timestamp. This means I can
>> utmost
>>>> have 60 partitions per topic (as each source has its own topic). I have
>>>> set num.partitions to be 30 (60/2) for each topic in broker config. I
>>> don't
>>>> have a very good reason to pick 30 as default number of partitions per
>>>> topic but I wanted it to be a high number so that I can achieve high
>>>> parallelism during in-stream processing. I am worried that having a
>> high
>>>> number  like 30 (default configuration had it as 2), it can negatively
>>>> impact kafka performance in terms of message throughput or memory
>>>> consumption. I understand that this can lead to many files per
>> partition
>>>> but I am thinking of dealing with it by having multiple directories on
>>> the
>>>> same disk if at all I run into issues.
>>>> 
>>>> My question to the community is that am I prematurely attempting to
>>>> optimizing the partition number as right now even a partition number
>> of 5
>>>> seems sufficient and hence will run into unwanted issues? Or is 30 an
>> Ok
>>>> number to use for number of partitions?
>>

Re: Is 30 a too high partition number?

Posted by Jason Rosenberg <jb...@squareup.com>.

What I would like to see is a way for inactive topics to automatically get
removed after they are inactive for a period of time.  That might help in
this case.

I added a comment to this larger jira:
https://issues.apache.org/jira/browse/KAFKA-330

Perhaps it should really be it's own jira entry.

Jason


On Tue, Oct 8, 2013 at 10:29 AM, Aniket Bhatnagar <
aniket.bhatnagar@gmail.com> wrote:

> Thanks Neha. Is it worthwhile to investigate an option to store topic
> metadata (partitions, etc) into another consistent data store (MySQL,
> HBase, etc)? Should we make this feature pluggable?
>
> The reason I am thinking we may need to go surpass the 2000 total partition
> limit is because there may be genuine use cases to have high number of
> topics. For example, in my particular case, I am using Kafka as a buffer to
> store data arriving from various sensors deployed in physical world. These
> sensors may be short lived or may be long lived. I was thinking of having
> individual topics for each sensor. This ways, if a badly behaving sensor
> attempts to pushes the data at a much faster rate than we can process as a
> Kafka consumer, we will eventually overflow and start losing data for that
> particular sensor. However, we can still potentially continue to process
> data from other sensors that are pushing data at manageable rate. If I go
> with 1 topic for all the sensors, 1 misbehaving sensor can potentially lead
> us not catching up with the topic in the retention period thus making us
> loose data from all sensors.
>
> The other issue is that if we go with a topic per sensor and the sensors
> are short lived and we have reached a threshold of 2000 sensors already
> deployed, Kafka will stop working (because of Zookeeper limitation) if
> though the previously deployed sensors may not be active at all.
>
> I am sure there may be other genuine use cases for having topics much
> larger than 2000.
>
>
> On 4 October 2013 19:04, Neha Narkhede <ne...@gmail.com> wrote:
>
> > You probably want to think of this in terms of number of partitions on a
> > single broker, instead of per topic since I/O is the limiting factor in
> > this case. Another factor to consider is total number of partitions in
> the
> > cluster as Zookeeper becomes a limiting factor there. 30 partitions is
> not
> > too large provided the total number of partitions doesn't exceed roughly
> > couple thousand. To give you an example, some of our clusters are 16
> nodes
> > big and some of the topics on those clusters have 30 partitions.
> >
> > Thanks,
> > Neha
> > On Oct 4, 2013 4:15 AM, "Aniket Bhatnagar" <an...@gmail.com>
> > wrote:
> >
> > > I am using kafka as a buffer for data streaming in from various
> sources.
> > > Since its a time series data, I generate the key to the message by
> > > combining source ID and minute in the timestamp. This means I can
> utmost
> > > have 60 partitions per topic (as each source has its own topic). I have
> > > set num.partitions to be 30 (60/2) for each topic in broker config. I
> > don't
> > > have a very good reason to pick 30 as default number of partitions per
> > > topic but I wanted it to be a high number so that I can achieve high
> > > parallelism during in-stream processing. I am worried that having a
> high
> > > number  like 30 (default configuration had it as 2), it can negatively
> > > impact kafka performance in terms of message throughput or memory
> > > consumption. I understand that this can lead to many files per
> partition
> > > but I am thinking of dealing with it by having multiple directories on
> > the
> > > same disk if at all I run into issues.
> > >
> > > My question to the community is that am I prematurely attempting to
> > > optimizing the partition number as right now even a partition number
> of 5
> > > seems sufficient and hence will run into unwanted issues? Or is 30 an
> Ok
> > > number to use for number of partitions?
> > >
> >
>

Re: Is 30 a too high partition number?

Posted by Aniket Bhatnagar <an...@gmail.com>.

Thanks Neha. Is it worthwhile to investigate an option to store topic
metadata (partitions, etc) into another consistent data store (MySQL,
HBase, etc)? Should we make this feature pluggable?

The reason I am thinking we may need to go surpass the 2000 total partition
limit is because there may be genuine use cases to have high number of
topics. For example, in my particular case, I am using Kafka as a buffer to
store data arriving from various sensors deployed in physical world. These
sensors may be short lived or may be long lived. I was thinking of having
individual topics for each sensor. This ways, if a badly behaving sensor
attempts to pushes the data at a much faster rate than we can process as a
Kafka consumer, we will eventually overflow and start losing data for that
particular sensor. However, we can still potentially continue to process
data from other sensors that are pushing data at manageable rate. If I go
with 1 topic for all the sensors, 1 misbehaving sensor can potentially lead
us not catching up with the topic in the retention period thus making us
loose data from all sensors.

The other issue is that if we go with a topic per sensor and the sensors
are short lived and we have reached a threshold of 2000 sensors already
deployed, Kafka will stop working (because of Zookeeper limitation) if
though the previously deployed sensors may not be active at all.

I am sure there may be other genuine use cases for having topics much
larger than 2000.

On 4 October 2013 19:04, Neha Narkhede <ne...@gmail.com> wrote:

> You probably want to think of this in terms of number of partitions on a
> single broker, instead of per topic since I/O is the limiting factor in
> this case. Another factor to consider is total number of partitions in the
> cluster as Zookeeper becomes a limiting factor there. 30 partitions is not
> too large provided the total number of partitions doesn't exceed roughly
> couple thousand. To give you an example, some of our clusters are 16 nodes
> big and some of the topics on those clusters have 30 partitions.
>
> Thanks,
> Neha
> On Oct 4, 2013 4:15 AM, "Aniket Bhatnagar" <an...@gmail.com>
> wrote:
>
> > I am using kafka as a buffer for data streaming in from various sources.
> > Since its a time series data, I generate the key to the message by
> > combining source ID and minute in the timestamp. This means I can utmost
> > have 60 partitions per topic (as each source has its own topic). I have
> > set num.partitions to be 30 (60/2) for each topic in broker config. I
> don't
> > have a very good reason to pick 30 as default number of partitions per
> > topic but I wanted it to be a high number so that I can achieve high
> > parallelism during in-stream processing. I am worried that having a high
> > number  like 30 (default configuration had it as 2), it can negatively
> > impact kafka performance in terms of message throughput or memory
> > consumption. I understand that this can lead to many files per partition
> > but I am thinking of dealing with it by having multiple directories on
> the
> > same disk if at all I run into issues.
> >
> > My question to the community is that am I prematurely attempting to
> > optimizing the partition number as right now even a partition number of 5
> > seems sufficient and hence will run into unwanted issues? Or is 30 an Ok
> > number to use for number of partitions?
> >
>

Re: Is 30 a too high partition number?

Posted by Neha Narkhede <ne...@gmail.com>.

You probably want to think of this in terms of number of partitions on a
single broker, instead of per topic since I/O is the limiting factor in
this case. Another factor to consider is total number of partitions in the
cluster as Zookeeper becomes a limiting factor there. 30 partitions is not
too large provided the total number of partitions doesn't exceed roughly
couple thousand. To give you an example, some of our clusters are 16 nodes
big and some of the topics on those clusters have 30 partitions.

Thanks,
Neha
On Oct 4, 2013 4:15 AM, "Aniket Bhatnagar" <an...@gmail.com>
wrote:

> I am using kafka as a buffer for data streaming in from various sources.
> Since its a time series data, I generate the key to the message by
> combining source ID and minute in the timestamp. This means I can utmost
> have 60 partitions per topic (as each source has its own topic). I have
> set num.partitions to be 30 (60/2) for each topic in broker config. I don't
> have a very good reason to pick 30 as default number of partitions per
> topic but I wanted it to be a high number so that I can achieve high
> parallelism during in-stream processing. I am worried that having a high
> number  like 30 (default configuration had it as 2), it can negatively
> impact kafka performance in terms of message throughput or memory
> consumption. I understand that this can lead to many files per partition
> but I am thinking of dealing with it by having multiple directories on the
> same disk if at all I run into issues.
>
> My question to the community is that am I prematurely attempting to
> optimizing the partition number as right now even a partition number of 5
> seems sufficient and hence will run into unwanted issues? Or is 30 an Ok
> number to use for number of partitions?
>