You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Jon Yeargers <jo...@cedexis.com> on 2017/01/24 16:18:30 UTC

Strategy for true random producer keying

If I don't specify a key when I call send a value to kafka (something akin
to 'kafkaProducer.send(new ProducerRecord<>(TOPIC_PRODUCE, jsonView))') how
is it keyed?

I am producing to a topic from an external feed. It appears to be heavily
biased towards certain values and as a result I have 2-3 partitions that
are lagging heavily where the rest are staying current. Since I don't use
the keys in my consumers Im wondering if I could randomize these values
somehow to better distribute the load.

Re: Strategy for true random producer keying

Posted by Jon Yeargers <jo...@cedexis.com>.
(cont'd) meant to say mod%partition count of System.currentTimeMillis().

Having said that - is there any disadvantage to true random distribution of
traffic for a topic?

On Tue, Jan 24, 2017 at 11:17 AM, Jon Yeargers <jo...@cedexis.com>
wrote:

> It may be picking a random partition but it sticks with it indefinitely
> despite there being a significant disparity in traffic. I need to break it
> up in some different fashion. Maybe just a hash of
> System.currentTimeMillis()?
>
>
>
> On Tue, Jan 24, 2017 at 10:52 AM, Avi Flax <av...@parkassist.com>
> wrote:
>
>>
>> > On Jan 24, 2017, at 11:18, Jon Yeargers <jo...@cedexis.com>
>> wrote:
>> >
>> > If I don't specify a key when I call send a value to kafka (something
>> akin
>> > to 'kafkaProducer.send(new ProducerRecord<>(TOPIC_PRODUCE,
>> jsonView))') how
>> > is it keyed?
>>
>> IIRC, in this case the key is null; i.e. there is no key.
>>
>> > I am producing to a topic from an external feed. It appears to be
>> heavily
>> > biased towards certain values and as a result I have 2-3 partitions that
>> > are lagging heavily where the rest are staying current.
>>
>> Hmm, according to the docs this shouldn’t matter:
>>
>> > If the key is null, then a random broker partition is picked.
>>
>> https://kafka.apache.org/documentation/#impl_producer
>>
>> You might want to double-check your code and confirm that it is indeed
>> sending no keys… i.e. maybe it’s actually using an empty string as a key,
>> or something like that.
>>
>> > Since I don't use
>> > the keys in my consumers Im wondering if I could randomize these values
>> > somehow to better distribute the load.
>>
>> As per the above docs, this _should_ already be the case, based on what
>> you’ve described.
>>
>> That said, if you continue to have trouble, then you can introduce your
>> own implementation of kafka.producer.Partitioner, and again as per the docs:
>>
>> > A custom partitioning strategy can also be plugged in using the
>> partitioner.class config parameter.
>>
>> Also, it so happens that I have implemented a custom random partitioning
>> strategy through an alternate approach by using the overloaded
>> ProducerRecord constructor that accepts a partition ID. You can easily get
>> the set of partition IDs from the Producer with the partitionsFor method.
>>
>> HTH!
>> Avi
>>
>> ————
>> Software Architect @ Park Assist » http://tech.parkassist.com/
>
>
>

Re: Strategy for true random producer keying

Posted by Avi Flax <av...@parkassist.com>.
> On Jan 24, 2017, at 14:17, Jon Yeargers <jo...@cedexis.com> wrote:
> 
> It may be picking a random partition but it sticks with it indefinitely
> despite there being a significant disparity in traffic.

Ah, I forgot to mention that IIRC the default Partitioner impl doesn’t choose a random partition for each individual record; it IIRC chooses one randomly every ~10 minutes, and for that period sends all records to that partition.

Sorry I don’t have a citation for this... I think it’s been mentioned before in this list somewhere. (And of course it’s in the source code.)

> I need to break it
> up in some different fashion. Maybe just a hash of
> System.currentTimeMillis()?

You could probably just use the result of currentTimeMillis() as the key. However, I don’t recommend using a synthetic key, because down the road other folks may end up thinking it has semantic value. Rather, I recommend you either implement a custom impl of Partitioner, or simple assign a random partition ID to each ProducerRecord as I described earlier.

> meant to say mod%partition count of System.currentTimeMillis()


Well, that’s actually the default partitioning algorithm, when a record has a key. So no need to re-implement that; as I wrote above you could just use the current time as the key, and that should yield the same behavior.

> is there any disadvantage to true random distribution of traffic for a topic?


Yes: you lose ordering. This may or may not matter for your application. It end ended up being a major problem for my application, so I switched to an entirely different topic/partition scheme in order to achieve my particular goals (isolating each customer’s data + evenly parallelizing I/O limited processing while retaining a certain required ordering).

HTH!

————
Software Architect @ Park Assist » http://tech.parkassist.com/

Re: Strategy for true random producer keying

Posted by Jon Yeargers <jo...@cedexis.com>.
It may be picking a random partition but it sticks with it indefinitely
despite there being a significant disparity in traffic. I need to break it
up in some different fashion. Maybe just a hash of
System.currentTimeMillis()?



On Tue, Jan 24, 2017 at 10:52 AM, Avi Flax <av...@parkassist.com> wrote:

>
> > On Jan 24, 2017, at 11:18, Jon Yeargers <jo...@cedexis.com>
> wrote:
> >
> > If I don't specify a key when I call send a value to kafka (something
> akin
> > to 'kafkaProducer.send(new ProducerRecord<>(TOPIC_PRODUCE, jsonView))')
> how
> > is it keyed?
>
> IIRC, in this case the key is null; i.e. there is no key.
>
> > I am producing to a topic from an external feed. It appears to be heavily
> > biased towards certain values and as a result I have 2-3 partitions that
> > are lagging heavily where the rest are staying current.
>
> Hmm, according to the docs this shouldn’t matter:
>
> > If the key is null, then a random broker partition is picked.
>
> https://kafka.apache.org/documentation/#impl_producer
>
> You might want to double-check your code and confirm that it is indeed
> sending no keys… i.e. maybe it’s actually using an empty string as a key,
> or something like that.
>
> > Since I don't use
> > the keys in my consumers Im wondering if I could randomize these values
> > somehow to better distribute the load.
>
> As per the above docs, this _should_ already be the case, based on what
> you’ve described.
>
> That said, if you continue to have trouble, then you can introduce your
> own implementation of kafka.producer.Partitioner, and again as per the docs:
>
> > A custom partitioning strategy can also be plugged in using the
> partitioner.class config parameter.
>
> Also, it so happens that I have implemented a custom random partitioning
> strategy through an alternate approach by using the overloaded
> ProducerRecord constructor that accepts a partition ID. You can easily get
> the set of partition IDs from the Producer with the partitionsFor method.
>
> HTH!
> Avi
>
> ————
> Software Architect @ Park Assist » http://tech.parkassist.com/

Re: Strategy for true random producer keying

Posted by Avi Flax <av...@parkassist.com>.
> On Jan 24, 2017, at 11:18, Jon Yeargers <jo...@cedexis.com> wrote:
> 
> If I don't specify a key when I call send a value to kafka (something akin
> to 'kafkaProducer.send(new ProducerRecord<>(TOPIC_PRODUCE, jsonView))') how
> is it keyed?

IIRC, in this case the key is null; i.e. there is no key.

> I am producing to a topic from an external feed. It appears to be heavily
> biased towards certain values and as a result I have 2-3 partitions that
> are lagging heavily where the rest are staying current.

Hmm, according to the docs this shouldn’t matter:

> If the key is null, then a random broker partition is picked.

https://kafka.apache.org/documentation/#impl_producer

You might want to double-check your code and confirm that it is indeed sending no keys… i.e. maybe it’s actually using an empty string as a key, or something like that.

> Since I don't use
> the keys in my consumers Im wondering if I could randomize these values
> somehow to better distribute the load.

As per the above docs, this _should_ already be the case, based on what you’ve described.

That said, if you continue to have trouble, then you can introduce your own implementation of kafka.producer.Partitioner, and again as per the docs:

> A custom partitioning strategy can also be plugged in using the partitioner.class config parameter.

Also, it so happens that I have implemented a custom random partitioning strategy through an alternate approach by using the overloaded ProducerRecord constructor that accepts a partition ID. You can easily get the set of partition IDs from the Producer with the partitionsFor method.

HTH!
Avi

————
Software Architect @ Park Assist » http://tech.parkassist.com/