You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Victoria Zuberman <vi...@imperva.com> on 2020/07/07 05:52:10 UTC

Keys and partitions

Hi,

I have userId as a key.
Many users have moderate amounts of data but some users have more and some users have huge amount of data.

I have been thinking about the following aspects of partitioning:

1. If two or more large users will fall into same partition I might end up with large partition/s (unbalanced with other partitions)
2. If smaller users fall in the same partition as a huge user the small users might get slower processing due to the amount of data the huge user has
3. If the order of the messages is not critical, maybe I would want to allow several consumers to work on the data of the same huge user, therefore I would like to partition one userId into several partitions

I have some ideas how to partition to solve those issues that but if you have something that worked well for you at production I would love to hear.
Also, any links to relevant blogposts/etc will be welcome

Thanks,
Victoria
-------------------------------------------
NOTICE:
This email and all attachments are confidential, may be proprietary, and may be privileged or otherwise protected from disclosure. They are intended solely for the individual or entity to whom the email is addressed. However, mistakes sometimes happen in addressing emails. If you believe that you are not an intended recipient, please stop reading immediately. Do not copy, forward, or rely on the contents in any way. Notify the sender and/or Imperva, Inc. by telephone at +1 (650) 832-6006 and then delete or destroy any copy of this email and its attachments. The sender reserves and asserts all rights to confidentiality, as well as any privileges that may apply. Any disclosure, copying, distribution or action taken or omitted to be taken by an unintended recipient in reliance on this message is prohibited and may be unlawful.
Please consider the environment before printing this email.

Re: Keys and partitions

Posted by Ricardo Ferreira <ri...@riferrei.com>.

It is also important to note that since the release 2.4 of Apache Kafka 
the DefaultPartitioner now implements a sticky partitioning strategy 
rather than round-robin based on the key. This means that if you need 
fine control over which partition records will end up given the key -- 
you ought to write your own partitioner class.

More information about this here 
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-480%3A+Sticky+Partitioner>.

Thanks,

-- Ricardo

On 7/7/20 9:54 AM, Vinicius Scheidegger wrote:
> Hi Victoria,
>
> If processing order is not a requirement you could define a random key and
> your load would be randomly distributed across partitions.
> So far I was unable to find a solution to perfectly distribute the load
> across partitions when records are created from multiple producers - random
> distribution might be good enough though.
>
> I hope it helps,
>
> Vinicius Scheidegger
>
>
> On Tue, Jul 7, 2020 at 7:52 AM Victoria Zuberman <
> victoria.zuberman@imperva.com> wrote:
>
>> Hi,
>>
>> I have userId as a key.
>> Many users have moderate amounts of data but some users have more and some
>> users have huge amount of data.
>>
>> I have been thinking about the following aspects of partitioning:
>>
>>    1.  If two or more large users will fall into same partition I might end
>> up with large partition/s (unbalanced with other partitions)
>>    2.  If smaller users fall in the same partition as a huge user the small
>> users might get slower processing due to the amount of data the huge user
>> has
>>    3.  If the order of the messages is not critical, maybe I would want to
>> allow several consumers to work on the data of the same huge user,
>> therefore I would like to partition one userId into several partitions
>>
>> I have some ideas how to partition to solve those issues that but if you
>> have something that worked well for you at production I would love to hear.
>> Also, any links to relevant blogposts/etc will be welcome
>>
>> Thanks,
>> Victoria
>> -------------------------------------------
>> NOTICE:
>> This email and all attachments are confidential, may be proprietary, and
>> may be privileged or otherwise protected from disclosure. They are intended
>> solely for the individual or entity to whom the email is addressed.
>> However, mistakes sometimes happen in addressing emails. If you believe
>> that you are not an intended recipient, please stop reading immediately. Do
>> not copy, forward, or rely on the contents in any way. Notify the sender
>> and/or Imperva, Inc. by telephone at +1 (650) 832-6006 and then delete or
>> destroy any copy of this email and its attachments. The sender reserves and
>> asserts all rights to confidentiality, as well as any privileges that may
>> apply. Any disclosure, copying, distribution or action taken or omitted to
>> be taken by an unintended recipient in reliance on this message is
>> prohibited and may be unlawful.
>> Please consider the environment before printing this email.
>>

Re: Keys and partitions

Posted by Vinicius Scheidegger <vi...@gmail.com>.

Hi Victoria,

If processing order is not a requirement you could define a random key and
your load would be randomly distributed across partitions.
So far I was unable to find a solution to perfectly distribute the load
across partitions when records are created from multiple producers - random
distribution might be good enough though.

I hope it helps,

Vinicius Scheidegger


On Tue, Jul 7, 2020 at 7:52 AM Victoria Zuberman <
victoria.zuberman@imperva.com> wrote:

> Hi,
>
> I have userId as a key.
> Many users have moderate amounts of data but some users have more and some
> users have huge amount of data.
>
> I have been thinking about the following aspects of partitioning:
>
>   1.  If two or more large users will fall into same partition I might end
> up with large partition/s (unbalanced with other partitions)
>   2.  If smaller users fall in the same partition as a huge user the small
> users might get slower processing due to the amount of data the huge user
> has
>   3.  If the order of the messages is not critical, maybe I would want to
> allow several consumers to work on the data of the same huge user,
> therefore I would like to partition one userId into several partitions
>
> I have some ideas how to partition to solve those issues that but if you
> have something that worked well for you at production I would love to hear.
> Also, any links to relevant blogposts/etc will be welcome
>
> Thanks,
> Victoria
> -------------------------------------------
> NOTICE:
> This email and all attachments are confidential, may be proprietary, and
> may be privileged or otherwise protected from disclosure. They are intended
> solely for the individual or entity to whom the email is addressed.
> However, mistakes sometimes happen in addressing emails. If you believe
> that you are not an intended recipient, please stop reading immediately. Do
> not copy, forward, or rely on the contents in any way. Notify the sender
> and/or Imperva, Inc. by telephone at +1 (650) 832-6006 and then delete or
> destroy any copy of this email and its attachments. The sender reserves and
> asserts all rights to confidentiality, as well as any privileges that may
> apply. Any disclosure, copying, distribution or action taken or omitted to
> be taken by an unintended recipient in reliance on this message is
> prohibited and may be unlawful.
> Please consider the environment before printing this email.
>