You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Roman Iakovlev <ro...@live.com> on 2014/08/08 21:35:31 UTC

Architecture: amount of partitions

Dear all,

 

I'm new to Kafka, and I'm considering using it for a maybe not very usual
purpose. I want it to be a backend for data synchronization between a
magnitude of devices, which are not always online (mobile and embedded
devices). All the synchronized information belong to some user, and can be
identified by the user id. There are several data types, and a user can have
many entries of each data type coming from many different devices.

 

This solution has to scale up to hundreds of thousands of users, and, as far
as I understand, Kafka stores every partition in a single file. I've been
thinking about creating a topic for every data type and a separate partition
for every user. Amount of data stored by every user is no more than several
megabytes over the whole lifetime, because the data stored would be keyed
messages, and I'm expecting it to be compacted.

 

So what I'm wondering is, would Kafka be a right approach for such task, and
if yes, would this architecture (one topic per data type and one partition
per user) scale to specified extent?

 

Thanks, 

Roman.

Re: Architecture: amount of partitions

Posted by Jonathan Weeks <jo...@gmail.com>.

The approach may well depend on your deploy horizon. Currently the offset tracking of each partition is done in Zookeeper, which places an upper limit on the number of topic/partitions you want to have and operate with any kind of efficiency.

In 0.8.2 hopefully coming in the next month or two, consumer offset tracking is done via Kafka topics / internally rather than in ZK, so the above partition count scalability issue isn’t as severe.

From the Broker side, some filesystems such as XFS have no problem with hundreds of thousands of files in a directory. My experience with EXT3,4 with lots of files is less happy.

Also, I’m not sure about your retention policy needs for messages in the broker (usually 7 days by default). Using Kafka as a long term DB probably isn’t a great fit.

Another approach to consider is to store users into fewer topics, and differentiate based on a message key which contains the user-id, for example.

Best Regards,

-JW

On Aug 8, 2014, at 12:35 PM, Roman Iakovlev <ro...@live.com> wrote:

> Dear all,
> 
> 
> 
> I'm new to Kafka, and I'm considering using it for a maybe not very usual
> purpose. I want it to be a backend for data synchronization between a
> magnitude of devices, which are not always online (mobile and embedded
> devices). All the synchronized information belong to some user, and can be
> identified by the user id. There are several data types, and a user can have
> many entries of each data type coming from many different devices.
> 
> 
> 
> This solution has to scale up to hundreds of thousands of users, and, as far
> as I understand, Kafka stores every partition in a single file. I've been
> thinking about creating a topic for every data type and a separate partition
> for every user. Amount of data stored by every user is no more than several
> megabytes over the whole lifetime, because the data stored would be keyed
> messages, and I'm expecting it to be compacted.
> 
> 
> 
> So what I'm wondering is, would Kafka be a right approach for such task, and
> if yes, would this architecture (one topic per data type and one partition
> per user) scale to specified extent?
> 
> 
> 
> Thanks, 
> 
> Roman.
>

Re: Architecture: amount of partitions

Posted by Guozhang Wang <wa...@gmail.com>.

Kane,

The in-built offset management is already in master branch, and will be
included in 0.8.2. For now you can give the current trunk a spin.

Guozhang


On Fri, Aug 8, 2014 at 1:42 PM, Kane Kane <ka...@gmail.com> wrote:

> Hello Guozhang,
>
> Is storing offsets in kafka topic already in master branch?
> We would like to use that feature, when do you plan to release 0.8.2?
> Can we use master branch meanwhile (i.e. is it stable enough).
>
> Thanks.
>
> On Fri, Aug 8, 2014 at 1:38 PM, Guozhang Wang <wa...@gmail.com> wrote:
> > Hi Roman,
> >
> > Current Kafka messaging guarantee is at-least once, and we are working on
> > transactional messaging features to make it exactly once. We are
> expecting
> > it to be used as synchronization/replication layer for some storage
> systems
> > as your use case after that.
> >
> > As for your design, since you will probably have a lot of users and each
> > user's data is small, you will end up with many small files on Kafka. If
> > all you want is order preserving per user, you can probably just use
> > keyed-messages with key as the user id, by that all messages with the
> same
> > key will end up into the same partition and hence consumed by the same
> > consumer client. With that you only need a fixed small number of
> partitions.
> >
> > Guozhang
> >
> >
> > On Fri, Aug 8, 2014 at 12:35 PM, Roman Iakovlev <roman.iakovlev@live.com
> >
> > wrote:
> >
> >> Dear all,
> >>
> >>
> >>
> >> I'm new to Kafka, and I'm considering using it for a maybe not very
> usual
> >> purpose. I want it to be a backend for data synchronization between a
> >> magnitude of devices, which are not always online (mobile and embedded
> >> devices). All the synchronized information belong to some user, and can
> be
> >> identified by the user id. There are several data types, and a user can
> >> have
> >> many entries of each data type coming from many different devices.
> >>
> >>
> >>
> >> This solution has to scale up to hundreds of thousands of users, and, as
> >> far
> >> as I understand, Kafka stores every partition in a single file. I've
> been
> >> thinking about creating a topic for every data type and a separate
> >> partition
> >> for every user. Amount of data stored by every user is no more than
> several
> >> megabytes over the whole lifetime, because the data stored would be
> keyed
> >> messages, and I'm expecting it to be compacted.
> >>
> >>
> >>
> >> So what I'm wondering is, would Kafka be a right approach for such task,
> >> and
> >> if yes, would this architecture (one topic per data type and one
> partition
> >> per user) scale to specified extent?
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Roman.
> >>
> >>
> >
> >
> > --
> > -- Guozhang
>



-- 
-- Guozhang

Re: Architecture: amount of partitions

Posted by Kane Kane <ka...@gmail.com>.

Hello Guozhang,

Is storing offsets in kafka topic already in master branch?
We would like to use that feature, when do you plan to release 0.8.2?
Can we use master branch meanwhile (i.e. is it stable enough).

Thanks.

On Fri, Aug 8, 2014 at 1:38 PM, Guozhang Wang <wa...@gmail.com> wrote:
> Hi Roman,
>
> Current Kafka messaging guarantee is at-least once, and we are working on
> transactional messaging features to make it exactly once. We are expecting
> it to be used as synchronization/replication layer for some storage systems
> as your use case after that.
>
> As for your design, since you will probably have a lot of users and each
> user's data is small, you will end up with many small files on Kafka. If
> all you want is order preserving per user, you can probably just use
> keyed-messages with key as the user id, by that all messages with the same
> key will end up into the same partition and hence consumed by the same
> consumer client. With that you only need a fixed small number of partitions.
>
> Guozhang
>
>
> On Fri, Aug 8, 2014 at 12:35 PM, Roman Iakovlev <ro...@live.com>
> wrote:
>
>> Dear all,
>>
>>
>>
>> I'm new to Kafka, and I'm considering using it for a maybe not very usual
>> purpose. I want it to be a backend for data synchronization between a
>> magnitude of devices, which are not always online (mobile and embedded
>> devices). All the synchronized information belong to some user, and can be
>> identified by the user id. There are several data types, and a user can
>> have
>> many entries of each data type coming from many different devices.
>>
>>
>>
>> This solution has to scale up to hundreds of thousands of users, and, as
>> far
>> as I understand, Kafka stores every partition in a single file. I've been
>> thinking about creating a topic for every data type and a separate
>> partition
>> for every user. Amount of data stored by every user is no more than several
>> megabytes over the whole lifetime, because the data stored would be keyed
>> messages, and I'm expecting it to be compacted.
>>
>>
>>
>> So what I'm wondering is, would Kafka be a right approach for such task,
>> and
>> if yes, would this architecture (one topic per data type and one partition
>> per user) scale to specified extent?
>>
>>
>>
>> Thanks,
>>
>> Roman.
>>
>>
>
>
> --
> -- Guozhang

Re: Architecture: amount of partitions

Posted by Guozhang Wang <wa...@gmail.com>.

Hi Roman,

Current Kafka messaging guarantee is at-least once, and we are working on
transactional messaging features to make it exactly once. We are expecting
it to be used as synchronization/replication layer for some storage systems
as your use case after that.

As for your design, since you will probably have a lot of users and each
user's data is small, you will end up with many small files on Kafka. If
all you want is order preserving per user, you can probably just use
keyed-messages with key as the user id, by that all messages with the same
key will end up into the same partition and hence consumed by the same
consumer client. With that you only need a fixed small number of partitions.

Guozhang

On Fri, Aug 8, 2014 at 12:35 PM, Roman Iakovlev <ro...@live.com>
wrote:

> Dear all,
>
>
>
> I'm new to Kafka, and I'm considering using it for a maybe not very usual
> purpose. I want it to be a backend for data synchronization between a
> magnitude of devices, which are not always online (mobile and embedded
> devices). All the synchronized information belong to some user, and can be
> identified by the user id. There are several data types, and a user can
> have
> many entries of each data type coming from many different devices.
>
>
>
> This solution has to scale up to hundreds of thousands of users, and, as
> far
> as I understand, Kafka stores every partition in a single file. I've been
> thinking about creating a topic for every data type and a separate
> partition
> for every user. Amount of data stored by every user is no more than several
> megabytes over the whole lifetime, because the data stored would be keyed
> messages, and I'm expecting it to be compacted.
>
>
>
> So what I'm wondering is, would Kafka be a right approach for such task,
> and
> if yes, would this architecture (one topic per data type and one partition
> per user) scale to specified extent?
>
>
>
> Thanks,
>
> Roman.
>
>

-- 
-- Guozhang