You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Ravindranath Akila <ra...@gmail.com> on 2013/10/06 05:18:51 UTC

Managing Millions of Paritions in Kafka

Initially, I thought dynamic topic creation can be used to maintain per
user data on Kafka. The I read that partitions can and should be used for
this instead.

If a partition is to be used to map a user, can there be a million, or even
billion partitions in a cluster? How does one go about designing such a
model.

Can the replication tool be used to assign, say partitions 1 - 10,000 on
replica 1, and 10,001 - 20,000 on replica 2?

If not, since there is a ulimit on the file system, should one model it
based on a replica/topic/partition approach. Say users 1-10,000 go on topic
10k-1, and has 10,000 partitions, and users 10,0001-20,000 go on topic
10k-2, and has 10,000 partitions.

Simply put, how can a million stateful data points be handled? (I deduced
that a userid-partition number mapping can be done via a partitioner, but
unless a server can be configured to handle only a given set of partitions,
with a range based notation, it is almost impossible to handle a large
dataset. Is it that Kafka can only handle a limited set of stateful data
right now?)

http://stackoverflow.com/questions/17205561/data-modeling-with-kafka-topics-and-partitions

Btw, why does Kafka have to keep open each partition? Can't a partition be
opened for read/write when needed only?

Thanks in advance!

Re: Managing Millions of Paritions in Kafka

Posted by Benjamin Black <b...@b3k.us>.

Ha ha, yes, exactly, you need a database. Kafka is a wonderful tool, but
not the right one for a job like that.


On Sun, Oct 6, 2013 at 7:03 PM, Ravindranath Akila <
ravindranathakila@gmail.com> wrote:

> Actually, we need a broker. But a more stateful one. Hence the decision to
> use TTL on hbase.
> On 7 Oct 2013 08:38, "Benjamin Black" <b...@b3k.us> wrote:
>
> > What you are discovering is that Kafka is a message broker, not a
> database.
> >
> >
> > On Sun, Oct 6, 2013 at 5:34 PM, Ravindranath Akila <
> > ravindranathakila@gmail.com> wrote:
> >
> > > Thanks a lot Neha!
> > >
> > > Actually, using keyed messages(with Simple Consumers) was the approach
> we
> > > took. But it seems we can't map each user to a new partition due to
> > > Zookeeper limitations. Rather, we will have to map a "group" of users
> on
> > > one partition. Then we can't fetch the messages for only one user.
> > >
> > > It seems our data is best put on HBase with a TTL and versioning.
> > >
> > > Thanks!
> > >
> > > R. A.
> > > On 6 Oct 2013 16:00, "Neha Narkhede" <ne...@gmail.com> wrote:
> > >
> > > > Kafka is designed to have of the order of few thousands of partitions
> > > > roughly less than 10,000. And the main bottleneck is zookeeper. A
> > better
> > > > way to design such a system is to have fewer partitions and use keyed
> > > > messages to distribute the data over a fixed set of partitions.
> > > >
> > > > Thanks,
> > > > Neha
> > > > On Oct 5, 2013 8:19 PM, "Ravindranath Akila" <
> > > ravindranathakila@gmail.com>
> > > > wrote:
> > > >
> > > > > Initially, I thought dynamic topic creation can be used to maintain
> > per
> > > > > user data on Kafka. The I read that partitions can and should be
> used
> > > for
> > > > > this instead.
> > > > >
> > > > > If a partition is to be used to map a user, can there be a million,
> > or
> > > > even
> > > > > billion partitions in a cluster? How does one go about designing
> > such a
> > > > > model.
> > > > >
> > > > > Can the replication tool be used to assign, say partitions 1 -
> 10,000
> > > on
> > > > > replica 1, and 10,001 - 20,000 on replica 2?
> > > > >
> > > > > If not, since there is a ulimit on the file system, should one
> model
> > it
> > > > > based on a replica/topic/partition approach. Say users 1-10,000 go
> on
> > > > topic
> > > > > 10k-1, and has 10,000 partitions, and users 10,0001-20,000 go on
> > topic
> > > > > 10k-2, and has 10,000 partitions.
> > > > >
> > > > > Simply put, how can a million stateful data points be handled? (I
> > > deduced
> > > > > that a userid-partition number mapping can be done via a
> partitioner,
> > > but
> > > > > unless a server can be configured to handle only a given set of
> > > > partitions,
> > > > > with a range based notation, it is almost impossible to handle a
> > large
> > > > > dataset. Is it that Kafka can only handle a limited set of stateful
> > > data
> > > > > right now?)
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> http://stackoverflow.com/questions/17205561/data-modeling-with-kafka-topics-and-partitions
> > > > >
> > > > > Btw, why does Kafka have to keep open each partition? Can't a
> > partition
> > > > be
> > > > > opened for read/write when needed only?
> > > > >
> > > > > Thanks in advance!
> > > > >
> > > >
> > >
> >
>

Re: Managing Millions of Paritions in Kafka

Posted by Ravindranath Akila <ra...@gmail.com>.

Actually, we need a broker. But a more stateful one. Hence the decision to
use TTL on hbase.
On 7 Oct 2013 08:38, "Benjamin Black" <b...@b3k.us> wrote:

> What you are discovering is that Kafka is a message broker, not a database.
>
>
> On Sun, Oct 6, 2013 at 5:34 PM, Ravindranath Akila <
> ravindranathakila@gmail.com> wrote:
>
> > Thanks a lot Neha!
> >
> > Actually, using keyed messages(with Simple Consumers) was the approach we
> > took. But it seems we can't map each user to a new partition due to
> > Zookeeper limitations. Rather, we will have to map a "group" of users on
> > one partition. Then we can't fetch the messages for only one user.
> >
> > It seems our data is best put on HBase with a TTL and versioning.
> >
> > Thanks!
> >
> > R. A.
> > On 6 Oct 2013 16:00, "Neha Narkhede" <ne...@gmail.com> wrote:
> >
> > > Kafka is designed to have of the order of few thousands of partitions
> > > roughly less than 10,000. And the main bottleneck is zookeeper. A
> better
> > > way to design such a system is to have fewer partitions and use keyed
> > > messages to distribute the data over a fixed set of partitions.
> > >
> > > Thanks,
> > > Neha
> > > On Oct 5, 2013 8:19 PM, "Ravindranath Akila" <
> > ravindranathakila@gmail.com>
> > > wrote:
> > >
> > > > Initially, I thought dynamic topic creation can be used to maintain
> per
> > > > user data on Kafka. The I read that partitions can and should be used
> > for
> > > > this instead.
> > > >
> > > > If a partition is to be used to map a user, can there be a million,
> or
> > > even
> > > > billion partitions in a cluster? How does one go about designing
> such a
> > > > model.
> > > >
> > > > Can the replication tool be used to assign, say partitions 1 - 10,000
> > on
> > > > replica 1, and 10,001 - 20,000 on replica 2?
> > > >
> > > > If not, since there is a ulimit on the file system, should one model
> it
> > > > based on a replica/topic/partition approach. Say users 1-10,000 go on
> > > topic
> > > > 10k-1, and has 10,000 partitions, and users 10,0001-20,000 go on
> topic
> > > > 10k-2, and has 10,000 partitions.
> > > >
> > > > Simply put, how can a million stateful data points be handled? (I
> > deduced
> > > > that a userid-partition number mapping can be done via a partitioner,
> > but
> > > > unless a server can be configured to handle only a given set of
> > > partitions,
> > > > with a range based notation, it is almost impossible to handle a
> large
> > > > dataset. Is it that Kafka can only handle a limited set of stateful
> > data
> > > > right now?)
> > > >
> > > >
> > > >
> > >
> >
> http://stackoverflow.com/questions/17205561/data-modeling-with-kafka-topics-and-partitions
> > > >
> > > > Btw, why does Kafka have to keep open each partition? Can't a
> partition
> > > be
> > > > opened for read/write when needed only?
> > > >
> > > > Thanks in advance!
> > > >
> > >
> >
>

Re: Managing Millions of Paritions in Kafka

Posted by Benjamin Black <b...@b3k.us>.

What you are discovering is that Kafka is a message broker, not a database.


On Sun, Oct 6, 2013 at 5:34 PM, Ravindranath Akila <
ravindranathakila@gmail.com> wrote:

> Thanks a lot Neha!
>
> Actually, using keyed messages(with Simple Consumers) was the approach we
> took. But it seems we can't map each user to a new partition due to
> Zookeeper limitations. Rather, we will have to map a "group" of users on
> one partition. Then we can't fetch the messages for only one user.
>
> It seems our data is best put on HBase with a TTL and versioning.
>
> Thanks!
>
> R. A.
> On 6 Oct 2013 16:00, "Neha Narkhede" <ne...@gmail.com> wrote:
>
> > Kafka is designed to have of the order of few thousands of partitions
> > roughly less than 10,000. And the main bottleneck is zookeeper. A better
> > way to design such a system is to have fewer partitions and use keyed
> > messages to distribute the data over a fixed set of partitions.
> >
> > Thanks,
> > Neha
> > On Oct 5, 2013 8:19 PM, "Ravindranath Akila" <
> ravindranathakila@gmail.com>
> > wrote:
> >
> > > Initially, I thought dynamic topic creation can be used to maintain per
> > > user data on Kafka. The I read that partitions can and should be used
> for
> > > this instead.
> > >
> > > If a partition is to be used to map a user, can there be a million, or
> > even
> > > billion partitions in a cluster? How does one go about designing such a
> > > model.
> > >
> > > Can the replication tool be used to assign, say partitions 1 - 10,000
> on
> > > replica 1, and 10,001 - 20,000 on replica 2?
> > >
> > > If not, since there is a ulimit on the file system, should one model it
> > > based on a replica/topic/partition approach. Say users 1-10,000 go on
> > topic
> > > 10k-1, and has 10,000 partitions, and users 10,0001-20,000 go on topic
> > > 10k-2, and has 10,000 partitions.
> > >
> > > Simply put, how can a million stateful data points be handled? (I
> deduced
> > > that a userid-partition number mapping can be done via a partitioner,
> but
> > > unless a server can be configured to handle only a given set of
> > partitions,
> > > with a range based notation, it is almost impossible to handle a large
> > > dataset. Is it that Kafka can only handle a limited set of stateful
> data
> > > right now?)
> > >
> > >
> > >
> >
> http://stackoverflow.com/questions/17205561/data-modeling-with-kafka-topics-and-partitions
> > >
> > > Btw, why does Kafka have to keep open each partition? Can't a partition
> > be
> > > opened for read/write when needed only?
> > >
> > > Thanks in advance!
> > >
> >
>

Re: Managing Millions of Paritions in Kafka

Posted by Ravindranath Akila <ra...@gmail.com>.

Thanks a lot Neha!

Actually, using keyed messages(with Simple Consumers) was the approach we
took. But it seems we can't map each user to a new partition due to
Zookeeper limitations. Rather, we will have to map a "group" of users on
one partition. Then we can't fetch the messages for only one user.

It seems our data is best put on HBase with a TTL and versioning.

Thanks!

R. A.
On 6 Oct 2013 16:00, "Neha Narkhede" <ne...@gmail.com> wrote:

> Kafka is designed to have of the order of few thousands of partitions
> roughly less than 10,000. And the main bottleneck is zookeeper. A better
> way to design such a system is to have fewer partitions and use keyed
> messages to distribute the data over a fixed set of partitions.
>
> Thanks,
> Neha
> On Oct 5, 2013 8:19 PM, "Ravindranath Akila" <ra...@gmail.com>
> wrote:
>
> > Initially, I thought dynamic topic creation can be used to maintain per
> > user data on Kafka. The I read that partitions can and should be used for
> > this instead.
> >
> > If a partition is to be used to map a user, can there be a million, or
> even
> > billion partitions in a cluster? How does one go about designing such a
> > model.
> >
> > Can the replication tool be used to assign, say partitions 1 - 10,000 on
> > replica 1, and 10,001 - 20,000 on replica 2?
> >
> > If not, since there is a ulimit on the file system, should one model it
> > based on a replica/topic/partition approach. Say users 1-10,000 go on
> topic
> > 10k-1, and has 10,000 partitions, and users 10,0001-20,000 go on topic
> > 10k-2, and has 10,000 partitions.
> >
> > Simply put, how can a million stateful data points be handled? (I deduced
> > that a userid-partition number mapping can be done via a partitioner, but
> > unless a server can be configured to handle only a given set of
> partitions,
> > with a range based notation, it is almost impossible to handle a large
> > dataset. Is it that Kafka can only handle a limited set of stateful data
> > right now?)
> >
> >
> >
> http://stackoverflow.com/questions/17205561/data-modeling-with-kafka-topics-and-partitions
> >
> > Btw, why does Kafka have to keep open each partition? Can't a partition
> be
> > opened for read/write when needed only?
> >
> > Thanks in advance!
> >
>

Re: Managing Millions of Paritions in Kafka

Posted by Neha Narkhede <ne...@gmail.com>.

Kafka is designed to have of the order of few thousands of partitions
roughly less than 10,000. And the main bottleneck is zookeeper. A better
way to design such a system is to have fewer partitions and use keyed
messages to distribute the data over a fixed set of partitions.

Thanks,
Neha
On Oct 5, 2013 8:19 PM, "Ravindranath Akila" <ra...@gmail.com>
wrote:

> Initially, I thought dynamic topic creation can be used to maintain per
> user data on Kafka. The I read that partitions can and should be used for
> this instead.
>
> If a partition is to be used to map a user, can there be a million, or even
> billion partitions in a cluster? How does one go about designing such a
> model.
>
> Can the replication tool be used to assign, say partitions 1 - 10,000 on
> replica 1, and 10,001 - 20,000 on replica 2?
>
> If not, since there is a ulimit on the file system, should one model it
> based on a replica/topic/partition approach. Say users 1-10,000 go on topic
> 10k-1, and has 10,000 partitions, and users 10,0001-20,000 go on topic
> 10k-2, and has 10,000 partitions.
>
> Simply put, how can a million stateful data points be handled? (I deduced
> that a userid-partition number mapping can be done via a partitioner, but
> unless a server can be configured to handle only a given set of partitions,
> with a range based notation, it is almost impossible to handle a large
> dataset. Is it that Kafka can only handle a limited set of stateful data
> right now?)
>
>
> http://stackoverflow.com/questions/17205561/data-modeling-with-kafka-topics-and-partitions
>
> Btw, why does Kafka have to keep open each partition? Can't a partition be
> opened for read/write when needed only?
>
> Thanks in advance!
>