You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Guillermo Ortiz <ko...@gmail.com> on 2015/12/01 13:31:19 UTC

Number of partitions and disks in a topic

Hello,

I want to size the kafka cluster with just one topic and I'm going to
process the data with Spark and others applications.

If I have six hard drives per node, is it kafka smart enough to deal with
them? I guess that the memory should be very important in this point and
all data is cached in memory. Is it possible to config kafka to use many
directories as HDFS, each one with a different disk?

I'm not sure about the number of partitions either. I have read
http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
and they talk about number of partitions much higher that I had thought. Is
it normal to have a topic with 1000 partitions? I was thinking about about
two/four partitions per node. is it wrong my thought?

As I'm going to process data with Spark, I could have numberPartitions
equals numberExecutors in Spark as max, always thinking in the future and
sizing higher than that.

Re: Number of partitions and disks in a topic

Posted by Todd Palino <tp...@gmail.com>.
Getting the partitioning right now is only important if your messages are
keyed. If they’re not, stop reading, start with a fairly low number of
partitions, and expand as needed.

1000 partitions per topic is generally not normal. It’s not really a
problem, but you want to size topics appropriately. Every partition
represents open file handles and overhead on the cluster controller. But if
you’re working with keyed messages, size for your eventual data size. We
use a general guideline of keeping partitions on disk under 25 GB (for 4
days of retention - so ~6 GB of compressed messages per day). We find this
gives us a good spread of data in the cluster, and represents a reasonable
amount of network throughput per partition, so it allows us to scale
easily. It also makes for fewer issues with replication within the cluster,
and mirroring to other clusters.

Outside of a guideline like that, partition based on how you want to spread
out your keys. We have a user who wanted 720 partitions for a given topic
because it has a large number of factors, which allows them to run a
variety of counts of consumers and have balanced load.

As far as multiple disks goes, yes, Kafka can make use of multiple log
dirs. However, there are caveats. It’s fairly naive about how it assigns
partitions to disks, and partitions are assigned by the controller to a
broker with no knowledge of the disks underneath. The broker then makes the
assignment to a single disk. In addition, there’s no tool for moving
partitions from one mount point to another without shutting down the broker
and doing it manually.

-Todd

On Tue, Dec 1, 2015 at 4:31 AM, Guillermo Ortiz <ko...@gmail.com>
wrote:

> Hello,
>
> I want to size the kafka cluster with just one topic and I'm going to
> process the data with Spark and others applications.
>
> If I have six hard drives per node, is it kafka smart enough to deal with
> them? I guess that the memory should be very important in this point and
> all data is cached in memory. Is it possible to config kafka to use many
> directories as HDFS, each one with a different disk?
>
> I'm not sure about the number of partitions either. I have read
>
> http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
> and they talk about number of partitions much higher that I had thought. Is
> it normal to have a topic with 1000 partitions? I was thinking about about
> two/four partitions per node. is it wrong my thought?
>
> As I'm going to process data with Spark, I could have numberPartitions
> equals numberExecutors in Spark as max, always thinking in the future and
> sizing higher than that.
>



-- 
*—-*
*Todd Palino*
Staff Site Reliability Engineer
Data Infrastructure Streaming



linkedin.com/in/toddpalino