You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Daniel Fagnan <da...@segment.com> on 2016/08/08 19:12:42 UTC

Large # of Topics/Partitions

Hey all,

I’m currently in the process of designing a system around Kafka and I’m wondering the recommended way to manage topics. Each event stream we have needs to be isolated from each other. A failure from one should not affect another event stream from processing (by failure, we mean a downstream failure that would require us to replay the messages).

So my first thought was to create a topic per event stream. This allows a larger event stream to be partitioned for added parallelism but keep the default # of partitions down as much as possible. This would solve the isolation requirement in that a topic can keep failing and we’ll continue replaying the messages without affected all the other topics.

We read it’s not recommended to have your data model dictate the # of partitions or topics in Kafka and we’re unsure about this approach if we need to triple our event stream.

We’re currently looking at 10,000 event streams (or topics) but we don’t want to be spinning up additional brokers just so we can add more event stream, especially if the load for each is reasonable.

Another option we were looking into was to not isolate at the topic/partition level but to keep a set of pending offsets persisted somewhere (seemingly what Twitter Heron or Storm does but they don’t seem to persist the pending offsets).

Thoughts?

Large # of Topics/Partitions

Posted by Jens Rantil <je...@tink.se>.

Hi,

This might also be of interest: http://www.confluent
.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/

Cheers,
Jens

On Monday, August 8, 2016, Daniel Fagnan <da...@segment.com> wrote:

> Thanks Tom! This was very helpful and I’ll explore having a more static
> set of partitions as that seems to fit Kafka a lot better.
>
> Cheers,
> Daniel
>
> > On Aug 8, 2016, at 12:27 PM, Tom Crayford <tc...@heroku.com> wrote:
> >
> > Hi Daniel,
> >
> > Kafka doesn't provide this kind of isolation or scalability for many many
> > streams. The usual design is to use a consistent hash of some "key" to
> > attribute your data to a particular partition. That of course, doesn't
> > isolate things fully, but has everything in a partition dependent on each
> > other.
> >
> > We've found that over a few thousand to a few tens of thousands of
> > partitions clusters hit a lot of issues (it depends on the write pattern,
> > how much memory you give brokers and zookeeper, and if you plan on ever
> > deleting topics).
> >
> > Another option is to manage multiple clusters, and keep under a certain
> > limit of partitions in each cluster. That is of course additional
> > operational overhead and complexity.
> >
> > I'm not sure I 100% understand your mechanism for tracking pending
> offsets,
> > but it seems like that might be your best option.
> >
> > Thanks
> >
> > Tom Crayford
> > Heroku Kafka
> >
> > On Mon, Aug 8, 2016 at 8:12 PM, Daniel Fagnan <da...@segment.com>
> wrote:
> >
> >> Hey all,
> >>
> >> I’m currently in the process of designing a system around Kafka and I’m
> >> wondering the recommended way to manage topics. Each event stream we
> have
> >> needs to be isolated from each other. A failure from one should not
> affect
> >> another event stream from processing (by failure, we mean a downstream
> >> failure that would require us to replay the messages).
> >>
> >> So my first thought was to create a topic per event stream. This allows
> a
> >> larger event stream to be partitioned for added parallelism but keep the
> >> default # of partitions down as much as possible. This would solve the
> >> isolation requirement in that a topic can keep failing and we’ll
> continue
> >> replaying the messages without affected all the other topics.
> >>
> >> We read it’s not recommended to have your data model dictate the # of
> >> partitions or topics in Kafka and we’re unsure about this approach if we
> >> need to triple our event stream.
> >>
> >> We’re currently looking at 10,000 event streams (or topics) but we don’t
> >> want to be spinning up additional brokers just so we can add more event
> >> stream, especially if the load for each is reasonable.
> >>
> >> Another option we were looking into was to not isolate at the
> >> topic/partition level but to keep a set of pending offsets persisted
> >> somewhere (seemingly what Twitter Heron or Storm does but they don’t
> seem
> >> to persist the pending offsets).
> >>
> >> Thoughts?
>
>

-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.rantil@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook <https://www.facebook.com/#!/tink.se> Linkedin
<http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
 Twitter <https://twitter.com/tink>

Re: Large # of Topics/Partitions

Posted by Daniel Fagnan <da...@segment.com>.

Thanks Tom! This was very helpful and I’ll explore having a more static set of partitions as that seems to fit Kafka a lot better.

Cheers,
Daniel

> On Aug 8, 2016, at 12:27 PM, Tom Crayford <tc...@heroku.com> wrote:
> 
> Hi Daniel,
> 
> Kafka doesn't provide this kind of isolation or scalability for many many
> streams. The usual design is to use a consistent hash of some "key" to
> attribute your data to a particular partition. That of course, doesn't
> isolate things fully, but has everything in a partition dependent on each
> other.
> 
> We've found that over a few thousand to a few tens of thousands of
> partitions clusters hit a lot of issues (it depends on the write pattern,
> how much memory you give brokers and zookeeper, and if you plan on ever
> deleting topics).
> 
> Another option is to manage multiple clusters, and keep under a certain
> limit of partitions in each cluster. That is of course additional
> operational overhead and complexity.
> 
> I'm not sure I 100% understand your mechanism for tracking pending offsets,
> but it seems like that might be your best option.
> 
> Thanks
> 
> Tom Crayford
> Heroku Kafka
> 
> On Mon, Aug 8, 2016 at 8:12 PM, Daniel Fagnan <da...@segment.com> wrote:
> 
>> Hey all,
>> 
>> I’m currently in the process of designing a system around Kafka and I’m
>> wondering the recommended way to manage topics. Each event stream we have
>> needs to be isolated from each other. A failure from one should not affect
>> another event stream from processing (by failure, we mean a downstream
>> failure that would require us to replay the messages).
>> 
>> So my first thought was to create a topic per event stream. This allows a
>> larger event stream to be partitioned for added parallelism but keep the
>> default # of partitions down as much as possible. This would solve the
>> isolation requirement in that a topic can keep failing and we’ll continue
>> replaying the messages without affected all the other topics.
>> 
>> We read it’s not recommended to have your data model dictate the # of
>> partitions or topics in Kafka and we’re unsure about this approach if we
>> need to triple our event stream.
>> 
>> We’re currently looking at 10,000 event streams (or topics) but we don’t
>> want to be spinning up additional brokers just so we can add more event
>> stream, especially if the load for each is reasonable.
>> 
>> Another option we were looking into was to not isolate at the
>> topic/partition level but to keep a set of pending offsets persisted
>> somewhere (seemingly what Twitter Heron or Storm does but they don’t seem
>> to persist the pending offsets).
>> 
>> Thoughts?

Re: Large # of Topics/Partitions

Posted by Tom Crayford <tc...@heroku.com>.

Hi Daniel,

Kafka doesn't provide this kind of isolation or scalability for many many
streams. The usual design is to use a consistent hash of some "key" to
attribute your data to a particular partition. That of course, doesn't
isolate things fully, but has everything in a partition dependent on each
other.

We've found that over a few thousand to a few tens of thousands of
partitions clusters hit a lot of issues (it depends on the write pattern,
how much memory you give brokers and zookeeper, and if you plan on ever
deleting topics).

Another option is to manage multiple clusters, and keep under a certain
limit of partitions in each cluster. That is of course additional
operational overhead and complexity.

I'm not sure I 100% understand your mechanism for tracking pending offsets,
but it seems like that might be your best option.

Thanks

Tom Crayford
Heroku Kafka

On Mon, Aug 8, 2016 at 8:12 PM, Daniel Fagnan <da...@segment.com> wrote:

> Hey all,
>
> I’m currently in the process of designing a system around Kafka and I’m
> wondering the recommended way to manage topics. Each event stream we have
> needs to be isolated from each other. A failure from one should not affect
> another event stream from processing (by failure, we mean a downstream
> failure that would require us to replay the messages).
>
> So my first thought was to create a topic per event stream. This allows a
> larger event stream to be partitioned for added parallelism but keep the
> default # of partitions down as much as possible. This would solve the
> isolation requirement in that a topic can keep failing and we’ll continue
> replaying the messages without affected all the other topics.
>
> We read it’s not recommended to have your data model dictate the # of
> partitions or topics in Kafka and we’re unsure about this approach if we
> need to triple our event stream.
>
> We’re currently looking at 10,000 event streams (or topics) but we don’t
> want to be spinning up additional brokers just so we can add more event
> stream, especially if the load for each is reasonable.
>
> Another option we were looking into was to not isolate at the
> topic/partition level but to keep a set of pending offsets persisted
> somewhere (seemingly what Twitter Heron or Storm does but they don’t seem
> to persist the pending offsets).
>
> Thoughts?