You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Eric Tschetter <ec...@gmail.com> on 2012/04/09 20:04:07 UTC

Number of feeds, how does it scale?

Hi guys,

I'm wondering about experiences with a large number of feeds created
and managed on a single Kafka cluster.  Specifically, if anyone can
share information about how many different feeds they have on their
kafka cluster and overall throughput, that'd be cool.

Some background: I'm planning on setting up a system around Kafka that
will (hopefully, eventually) have >10,000 feeds in parallel.  I expect
event volume on these feeds to follow a zipfian distribution.  So,
there will be a long-tail of smaller feeds and some large ones, but
there will be consumers for each of these feeds.  I'm trying to decide
between relying on Kafka's feeds to maintain the separation between
the data streams, or if I should actually create one large aggregate
feed and utilize Kafka's partitioning mechanisms along with some
custom logic to keep the feeds separated.  I prefer to use Kafka's
built-in feed mechanisms, cause there are significant benefits to
that, but I can also imagine a world where that many feeds was not in
the base assumptions of how the system would be used and thus
questionable around performance.

Any input is appreciated.

--Eric

Re: Number of feeds, how does it scale?

Posted by Jay Kreps <ja...@gmail.com>.
Hey Eric,

I think the most topics we have on a single cluster at linkedin is around
300. Our usage is closer to the "one big topic, partitioned by some
relevant key". I am interested in working out any bugs with larger number
of topics, so if you hit anything we can probably help to work through it,
but at least based on our own usage I can't guarantee you won't see
anything.

Things to consider:
- You probably don't want every topic on every machine so that no single
machine needs to have 10k topics. The ZK producer should respect this, but
there isn't a convenient "create topic" command line tool to let you set
the number of machines that host the topic.
- There might be issues around the amount of zk metadata. For example the
thing Taylor mentions where we sequentially process topics...
- Might be good to run it once with hprof or equivalent enabled to make
sure we haven't done anything stupid internally
- One unclean shutdown we run recovery on the last segment of each log. So
if your log segment size is 100MB and you have 10k topics that is like 1TB
of recovery which will be a bit slow. A quick hack fix is to just set the
segment size smaller. A better fix is for kafka to periodically save out
safe recovery points. We had a patch somewhere floating around to do this,
but we haven't gotten it in to trunk, I don't think...

-Jay

On Mon, Apr 9, 2012 at 11:04 AM, Eric Tschetter <ec...@gmail.com> wrote:

> Hi guys,
>
> I'm wondering about experiences with a large number of feeds created
> and managed on a single Kafka cluster.  Specifically, if anyone can
> share information about how many different feeds they have on their
> kafka cluster and overall throughput, that'd be cool.
>
> Some background: I'm planning on setting up a system around Kafka that
> will (hopefully, eventually) have >10,000 feeds in parallel.  I expect
> event volume on these feeds to follow a zipfian distribution.  So,
> there will be a long-tail of smaller feeds and some large ones, but
> there will be consumers for each of these feeds.  I'm trying to decide
> between relying on Kafka's feeds to maintain the separation between
> the data streams, or if I should actually create one large aggregate
> feed and utilize Kafka's partitioning mechanisms along with some
> custom logic to keep the feeds separated.  I prefer to use Kafka's
> built-in feed mechanisms, cause there are significant benefits to
> that, but I can also imagine a world where that many feeds was not in
> the base assumptions of how the system would be used and thus
> questionable around performance.
>
> Any input is appreciated.
>
> --Eric
>

Re: Number of feeds, how does it scale?

Posted by Taylor Gautier <tg...@tagged.com>.
During startup, if I recall correctly, there is a validation sequence with
zookeeper enabled.  For each topic on disk I think Kafka checks with
zookeeper to validate that it's valid.  From my recollection, this process
does take a small amount of time, and I think is done serially, so with
5-10k topics the startup would take several minutes instead of several
seconds when zookeeper was disabled.

Before we implemented the runtime cleaner we had to shut down kafka,
manually wipe the topics that were not in use, and then start it back up,
so start up time was critical.

At this time we could turn zookeeper back on, but since each kafka instance
is running in isolation from the others, there isn't any practical reason
it's needed, in fact it just complicates the startup procedure for each
machine as we have to first start zookeeper then kafka.

In all honesty, we were in a mad scramble to just release our
implementation, and make it work, so we didn't work hard to make things
work in an ideal sense.  Since fixing the last issues sometime in early
January it has been running with practically zero effort on our part, so
it's been hard to be incentivized to go in and fix/change anything (or
reconsider any decisions we made) - aka don't fix something that ain't
broke.

We have several offshoot kafka use cases coming down the pike, so this
decision as well as others may get some time for reconsideration.

On Mon, Apr 9, 2012 at 12:53 PM, Neha Narkhede <ne...@gmail.com>wrote:

> Taylor,
>
> Thanks for sharing the details of your Kafka deployment! I'm wondering
> if you could share why you needed to turn off zookeeper registration
> on the Kafka brokers. Currently, the Kafka broker creates a node in ZK
> when
>
> 1. it starts up
> 2. it receives a message for a new topic
>
> >> Note that we will continue with our current sharded solution, because
> > adding Zookeeper back in would likely cause a lot of bottlenecks, not
> just
> > with the startup (all topics are stored in a flat hiearchy in Zookeeper
> > too, so it's likely this might not do well at high scale)
>
> Zookeeper can handle 10s of thousands of writes / sec. So are you
> saying that topics get created at such a high rate that you get close
> to the maximum write throughput that zookeeper can provide ?
>
> Thanks,
> Neha
>
> On Mon, Apr 9, 2012 at 12:11 PM, Taylor Gautier <tg...@tagged.com>
> wrote:
> > We've worked on this problem at Tagged.  In the current implementation a
> > current Kafka cluster can handle around 20k simultaneous topics.  This is
> > using a RHEL 5 machine with EXT3 backed filesystem, and the bottleneck is
> > mostly in the filesystem itself (due to the fact that Kafka stores all
> > topics in a single directory).
> >
> > We have implemented a fast cleaner that deletes topics that are no longer
> > used.  This cleaner really wipes the topic, instead of leaving a
> > file/directory around as the normal cleaner does.  Since this resets the
> > topic offset back to 0, you have to make sure your clients can handle
> this
> > (however, in theory, your clients already need to handle this situation
> > since it is a normal part of Kafka should the topic offset wrap around
> the
> > 64-bit value, though in practice it probably never happens).
> >
> > The other thing of note is that this number of topics with zookeeper
> turned
> > on makes startup intolerably slow as Kafka goes through a validation
> > process.  Therefore we also have turned off zookeeper.
> >
> > To allow us to handle a high number of topics in this configuration, we
> > have written a custom sharding/routing on top of Kafka.  The layers look
> > like this:
> >
> >
> +----------------------------------------------------------------------------+
> > | Pub/sub (implements custom sharding and shields from specifics of
> Kafka)
> >  |
> >
> +----------------------------------------------------------------------------+
> > | Kafka
> >  |
> >
> +----------------------------------------------------------------------------+
> >
> > Currently we are handling concurrently on the order of 100k topics.  We
> > plan to 10x that in the near future.  To do so we are planning on
> > implementing the following:
> >
> >   - SSD backed disks (vs. the current eSATA backed machines)
> >   - a topic hierarchy (to alleviate the 20k bottleneck
> >
> > Note that we will continue with our current sharded solution, because
> > adding Zookeeper back in would likely cause a lot of bottlenecks, not
> just
> > with the startup (all topics are stored in a flat hiearchy in Zookeeper
> > too, so it's likely this might not do well at high scale).
> >
> > Hope that helps!
> >
> > On Mon, Apr 9, 2012 at 11:04 AM, Eric Tschetter <ec...@gmail.com>
> wrote:
> >
> >> Hi guys,
> >>
> >> I'm wondering about experiences with a large number of feeds created
> >> and managed on a single Kafka cluster.  Specifically, if anyone can
> >> share information about how many different feeds they have on their
> >> kafka cluster and overall throughput, that'd be cool.
> >>
> >> Some background: I'm planning on setting up a system around Kafka that
> >> will (hopefully, eventually) have >10,000 feeds in parallel.  I expect
> >> event volume on these feeds to follow a zipfian distribution.  So,
> >> there will be a long-tail of smaller feeds and some large ones, but
> >> there will be consumers for each of these feeds.  I'm trying to decide
> >> between relying on Kafka's feeds to maintain the separation between
> >> the data streams, or if I should actually create one large aggregate
> >> feed and utilize Kafka's partitioning mechanisms along with some
> >> custom logic to keep the feeds separated.  I prefer to use Kafka's
> >> built-in feed mechanisms, cause there are significant benefits to
> >> that, but I can also imagine a world where that many feeds was not in
> >> the base assumptions of how the system would be used and thus
> >> questionable around performance.
> >>
> >> Any input is appreciated.
> >>
> >> --Eric
> >>
>

Re: Number of feeds, how does it scale?

Posted by Neha Narkhede <ne...@gmail.com>.
Taylor,

Thanks for sharing the details of your Kafka deployment! I'm wondering
if you could share why you needed to turn off zookeeper registration
on the Kafka brokers. Currently, the Kafka broker creates a node in ZK
when

1. it starts up
2. it receives a message for a new topic

>> Note that we will continue with our current sharded solution, because
> adding Zookeeper back in would likely cause a lot of bottlenecks, not just
> with the startup (all topics are stored in a flat hiearchy in Zookeeper
> too, so it's likely this might not do well at high scale)

Zookeeper can handle 10s of thousands of writes / sec. So are you
saying that topics get created at such a high rate that you get close
to the maximum write throughput that zookeeper can provide ?

Thanks,
Neha

On Mon, Apr 9, 2012 at 12:11 PM, Taylor Gautier <tg...@tagged.com> wrote:
> We've worked on this problem at Tagged.  In the current implementation a
> current Kafka cluster can handle around 20k simultaneous topics.  This is
> using a RHEL 5 machine with EXT3 backed filesystem, and the bottleneck is
> mostly in the filesystem itself (due to the fact that Kafka stores all
> topics in a single directory).
>
> We have implemented a fast cleaner that deletes topics that are no longer
> used.  This cleaner really wipes the topic, instead of leaving a
> file/directory around as the normal cleaner does.  Since this resets the
> topic offset back to 0, you have to make sure your clients can handle this
> (however, in theory, your clients already need to handle this situation
> since it is a normal part of Kafka should the topic offset wrap around the
> 64-bit value, though in practice it probably never happens).
>
> The other thing of note is that this number of topics with zookeeper turned
> on makes startup intolerably slow as Kafka goes through a validation
> process.  Therefore we also have turned off zookeeper.
>
> To allow us to handle a high number of topics in this configuration, we
> have written a custom sharding/routing on top of Kafka.  The layers look
> like this:
>
> +----------------------------------------------------------------------------+
> | Pub/sub (implements custom sharding and shields from specifics of Kafka)
>  |
> +----------------------------------------------------------------------------+
> | Kafka
>  |
> +----------------------------------------------------------------------------+
>
> Currently we are handling concurrently on the order of 100k topics.  We
> plan to 10x that in the near future.  To do so we are planning on
> implementing the following:
>
>   - SSD backed disks (vs. the current eSATA backed machines)
>   - a topic hierarchy (to alleviate the 20k bottleneck
>
> Note that we will continue with our current sharded solution, because
> adding Zookeeper back in would likely cause a lot of bottlenecks, not just
> with the startup (all topics are stored in a flat hiearchy in Zookeeper
> too, so it's likely this might not do well at high scale).
>
> Hope that helps!
>
> On Mon, Apr 9, 2012 at 11:04 AM, Eric Tschetter <ec...@gmail.com> wrote:
>
>> Hi guys,
>>
>> I'm wondering about experiences with a large number of feeds created
>> and managed on a single Kafka cluster.  Specifically, if anyone can
>> share information about how many different feeds they have on their
>> kafka cluster and overall throughput, that'd be cool.
>>
>> Some background: I'm planning on setting up a system around Kafka that
>> will (hopefully, eventually) have >10,000 feeds in parallel.  I expect
>> event volume on these feeds to follow a zipfian distribution.  So,
>> there will be a long-tail of smaller feeds and some large ones, but
>> there will be consumers for each of these feeds.  I'm trying to decide
>> between relying on Kafka's feeds to maintain the separation between
>> the data streams, or if I should actually create one large aggregate
>> feed and utilize Kafka's partitioning mechanisms along with some
>> custom logic to keep the feeds separated.  I prefer to use Kafka's
>> built-in feed mechanisms, cause there are significant benefits to
>> that, but I can also imagine a world where that many feeds was not in
>> the base assumptions of how the system would be used and thus
>> questionable around performance.
>>
>> Any input is appreciated.
>>
>> --Eric
>>

Re: Number of feeds, how does it scale?

Posted by Taylor Gautier <tg...@tagged.com>.
We've worked on this problem at Tagged.  In the current implementation a
current Kafka cluster can handle around 20k simultaneous topics.  This is
using a RHEL 5 machine with EXT3 backed filesystem, and the bottleneck is
mostly in the filesystem itself (due to the fact that Kafka stores all
topics in a single directory).

We have implemented a fast cleaner that deletes topics that are no longer
used.  This cleaner really wipes the topic, instead of leaving a
file/directory around as the normal cleaner does.  Since this resets the
topic offset back to 0, you have to make sure your clients can handle this
(however, in theory, your clients already need to handle this situation
since it is a normal part of Kafka should the topic offset wrap around the
64-bit value, though in practice it probably never happens).

The other thing of note is that this number of topics with zookeeper turned
on makes startup intolerably slow as Kafka goes through a validation
process.  Therefore we also have turned off zookeeper.

To allow us to handle a high number of topics in this configuration, we
have written a custom sharding/routing on top of Kafka.  The layers look
like this:

+----------------------------------------------------------------------------+
| Pub/sub (implements custom sharding and shields from specifics of Kafka)
  |
+----------------------------------------------------------------------------+
| Kafka
 |
+----------------------------------------------------------------------------+

Currently we are handling concurrently on the order of 100k topics.  We
plan to 10x that in the near future.  To do so we are planning on
implementing the following:

   - SSD backed disks (vs. the current eSATA backed machines)
   - a topic hierarchy (to alleviate the 20k bottleneck

Note that we will continue with our current sharded solution, because
adding Zookeeper back in would likely cause a lot of bottlenecks, not just
with the startup (all topics are stored in a flat hiearchy in Zookeeper
too, so it's likely this might not do well at high scale).

Hope that helps!

On Mon, Apr 9, 2012 at 11:04 AM, Eric Tschetter <ec...@gmail.com> wrote:

> Hi guys,
>
> I'm wondering about experiences with a large number of feeds created
> and managed on a single Kafka cluster.  Specifically, if anyone can
> share information about how many different feeds they have on their
> kafka cluster and overall throughput, that'd be cool.
>
> Some background: I'm planning on setting up a system around Kafka that
> will (hopefully, eventually) have >10,000 feeds in parallel.  I expect
> event volume on these feeds to follow a zipfian distribution.  So,
> there will be a long-tail of smaller feeds and some large ones, but
> there will be consumers for each of these feeds.  I'm trying to decide
> between relying on Kafka's feeds to maintain the separation between
> the data streams, or if I should actually create one large aggregate
> feed and utilize Kafka's partitioning mechanisms along with some
> custom logic to keep the feeds separated.  I prefer to use Kafka's
> built-in feed mechanisms, cause there are significant benefits to
> that, but I can also imagine a world where that many feeds was not in
> the base assumptions of how the system would be used and thus
> questionable around performance.
>
> Any input is appreciated.
>
> --Eric
>