You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Bhavesh Mistry <mi...@gmail.com> on 2014/08/05 03:12:23 UTC

Uniform Distribution of Messages for Topic Across Partitions Without Effecting Performance

How to achieve uniform distribution of non-keyed messages per topic across
all partitions?

We have tried to do this uniform distribution across partition using custom
partitioning from each producer instance using round robing (
count(messages) % number of partition for topic). This strategy results in
very poor performance.  So we have switched back to random stickiness that
Kafka provide out of box per some interval ( 10 minutes not sure exactly )
per topic.

The above strategy results in consumer side lags sometime for some
partitions because we have some applications/producers  producing more
messages for same topic than other servers.

Can Kafka provide out of box uniform distribution by using coordination
among all producers and rely on measure rate such as  # messages per minute
or # of bytes produce per minute to achieve uniform distribution and
coordinate stickiness of partition among hundreds of producers for same
topic ?

Thanks,

Bhavesh

Re: Uniform Distribution of Messages for Topic Across Partitions Without Effecting Performance

Posted by Jun Rao <ju...@gmail.com>.
In the new producer, a client can specify the partition number for each
message. Then, any partitioning strategy can be implemented by the client.

Thanks,

Jun


On Thu, Aug 7, 2014 at 1:37 PM, Bhavesh Mistry <mi...@gmail.com>
wrote:

> The root of problem is consumer lag on one or two partition even with no op
> ( read log and discard it) consumer .  Our use case is very simple.  Send
> all the log lines to Brokers.  But under storm of data (due to exception or
> application error etc), one or two partition gets lags behind while other
> consumer are at 0 lag.  We have tune the GC using the recommended GC
> setting (according to
> http://www.slideshare.net/ToddPalino/enterprise-kafka-kafka-as-a-service
> tuning section )   In normal situation, this is ok.
>
> Hashing based on a key, and sticking to Murmur hash(key) % number of
> partition did not give did not give a better throughput as compare to
> random partitioning.   It would be good to build intelligence about
> producer selection based on rate of data for topic and/or lag.   Is there
> any way to customize stickiness interval for random partitioning strategy
>  ?
>
> sorry for late response.
>
> Thanks,
>
> Bhavesh
>
>
> On Mon, Aug 4, 2014 at 6:50 PM, Joe Stein <jo...@stealth.ly> wrote:
>
> > Bhavesh, take a look at
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyisdatanotevenlydistributedamongpartitionswhenapartitioningkeyisnotspecified
> > ?
> >
> > Maybe the root cause issue is something else? Even if producers produce
> > more or less than what they are producing you should be able to make it
> > random enough with a partitioner and a key.  I don't think you should
> need
> > more than what is in the FAQ but incase so maybe look into
> > http://en.wikipedia.org/wiki/MurmurHash as another hash option.
> >
> > /*******************************************
> >  Joe Stein
> >  Founder, Principal Consultant
> >  Big Data Open Source Security LLC
> >  http://www.stealth.ly
> >  Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
> > ********************************************/
> >
> >
> > On Mon, Aug 4, 2014 at 9:12 PM, Bhavesh Mistry <
> mistry.p.bhavesh@gmail.com
> > >
> > wrote:
> >
> > > How to achieve uniform distribution of non-keyed messages per topic
> > across
> > > all partitions?
> > >
> > > We have tried to do this uniform distribution across partition using
> > custom
> > > partitioning from each producer instance using round robing (
> > > count(messages) % number of partition for topic). This strategy results
> > in
> > > very poor performance.  So we have switched back to random stickiness
> > that
> > > Kafka provide out of box per some interval ( 10 minutes not sure
> exactly
> > )
> > > per topic.
> > >
> > > The above strategy results in consumer side lags sometime for some
> > > partitions because we have some applications/producers  producing more
> > > messages for same topic than other servers.
> > >
> > > Can Kafka provide out of box uniform distribution by using coordination
> > > among all producers and rely on measure rate such as  # messages per
> > minute
> > > or # of bytes produce per minute to achieve uniform distribution and
> > > coordinate stickiness of partition among hundreds of producers for same
> > > topic ?
> > >
> > > Thanks,
> > >
> > > Bhavesh
> > >
> >
>

Re: Uniform Distribution of Messages for Topic Across Partitions Without Effecting Performance

Posted by Jun Rao <ju...@gmail.com>.
In the new producer, a client can specify the partition number for each
message. Then, any partitioning strategy can be implemented by the client.

Thanks,

Jun


On Thu, Aug 7, 2014 at 1:37 PM, Bhavesh Mistry <mi...@gmail.com>
wrote:

> The root of problem is consumer lag on one or two partition even with no op
> ( read log and discard it) consumer .  Our use case is very simple.  Send
> all the log lines to Brokers.  But under storm of data (due to exception or
> application error etc), one or two partition gets lags behind while other
> consumer are at 0 lag.  We have tune the GC using the recommended GC
> setting (according to
> http://www.slideshare.net/ToddPalino/enterprise-kafka-kafka-as-a-service
> tuning section )   In normal situation, this is ok.
>
> Hashing based on a key, and sticking to Murmur hash(key) % number of
> partition did not give did not give a better throughput as compare to
> random partitioning.   It would be good to build intelligence about
> producer selection based on rate of data for topic and/or lag.   Is there
> any way to customize stickiness interval for random partitioning strategy
>  ?
>
> sorry for late response.
>
> Thanks,
>
> Bhavesh
>
>
> On Mon, Aug 4, 2014 at 6:50 PM, Joe Stein <jo...@stealth.ly> wrote:
>
> > Bhavesh, take a look at
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyisdatanotevenlydistributedamongpartitionswhenapartitioningkeyisnotspecified
> > ?
> >
> > Maybe the root cause issue is something else? Even if producers produce
> > more or less than what they are producing you should be able to make it
> > random enough with a partitioner and a key.  I don't think you should
> need
> > more than what is in the FAQ but incase so maybe look into
> > http://en.wikipedia.org/wiki/MurmurHash as another hash option.
> >
> > /*******************************************
> >  Joe Stein
> >  Founder, Principal Consultant
> >  Big Data Open Source Security LLC
> >  http://www.stealth.ly
> >  Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
> > ********************************************/
> >
> >
> > On Mon, Aug 4, 2014 at 9:12 PM, Bhavesh Mistry <
> mistry.p.bhavesh@gmail.com
> > >
> > wrote:
> >
> > > How to achieve uniform distribution of non-keyed messages per topic
> > across
> > > all partitions?
> > >
> > > We have tried to do this uniform distribution across partition using
> > custom
> > > partitioning from each producer instance using round robing (
> > > count(messages) % number of partition for topic). This strategy results
> > in
> > > very poor performance.  So we have switched back to random stickiness
> > that
> > > Kafka provide out of box per some interval ( 10 minutes not sure
> exactly
> > )
> > > per topic.
> > >
> > > The above strategy results in consumer side lags sometime for some
> > > partitions because we have some applications/producers  producing more
> > > messages for same topic than other servers.
> > >
> > > Can Kafka provide out of box uniform distribution by using coordination
> > > among all producers and rely on measure rate such as  # messages per
> > minute
> > > or # of bytes produce per minute to achieve uniform distribution and
> > > coordinate stickiness of partition among hundreds of producers for same
> > > topic ?
> > >
> > > Thanks,
> > >
> > > Bhavesh
> > >
> >
>

Re: Uniform Distribution of Messages for Topic Across Partitions Without Effecting Performance

Posted by Bhavesh Mistry <mi...@gmail.com>.
The root of problem is consumer lag on one or two partition even with no op
( read log and discard it) consumer .  Our use case is very simple.  Send
all the log lines to Brokers.  But under storm of data (due to exception or
application error etc), one or two partition gets lags behind while other
consumer are at 0 lag.  We have tune the GC using the recommended GC
setting (according to
http://www.slideshare.net/ToddPalino/enterprise-kafka-kafka-as-a-service
tuning section )   In normal situation, this is ok.

Hashing based on a key, and sticking to Murmur hash(key) % number of
partition did not give did not give a better throughput as compare to
random partitioning.   It would be good to build intelligence about
producer selection based on rate of data for topic and/or lag.   Is there
any way to customize stickiness interval for random partitioning strategy  ?

sorry for late response.

Thanks,

Bhavesh


On Mon, Aug 4, 2014 at 6:50 PM, Joe Stein <jo...@stealth.ly> wrote:

> Bhavesh, take a look at
>
> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyisdatanotevenlydistributedamongpartitionswhenapartitioningkeyisnotspecified
> ?
>
> Maybe the root cause issue is something else? Even if producers produce
> more or less than what they are producing you should be able to make it
> random enough with a partitioner and a key.  I don't think you should need
> more than what is in the FAQ but incase so maybe look into
> http://en.wikipedia.org/wiki/MurmurHash as another hash option.
>
> /*******************************************
>  Joe Stein
>  Founder, Principal Consultant
>  Big Data Open Source Security LLC
>  http://www.stealth.ly
>  Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
> ********************************************/
>
>
> On Mon, Aug 4, 2014 at 9:12 PM, Bhavesh Mistry <mistry.p.bhavesh@gmail.com
> >
> wrote:
>
> > How to achieve uniform distribution of non-keyed messages per topic
> across
> > all partitions?
> >
> > We have tried to do this uniform distribution across partition using
> custom
> > partitioning from each producer instance using round robing (
> > count(messages) % number of partition for topic). This strategy results
> in
> > very poor performance.  So we have switched back to random stickiness
> that
> > Kafka provide out of box per some interval ( 10 minutes not sure exactly
> )
> > per topic.
> >
> > The above strategy results in consumer side lags sometime for some
> > partitions because we have some applications/producers  producing more
> > messages for same topic than other servers.
> >
> > Can Kafka provide out of box uniform distribution by using coordination
> > among all producers and rely on measure rate such as  # messages per
> minute
> > or # of bytes produce per minute to achieve uniform distribution and
> > coordinate stickiness of partition among hundreds of producers for same
> > topic ?
> >
> > Thanks,
> >
> > Bhavesh
> >
>

Re: Uniform Distribution of Messages for Topic Across Partitions Without Effecting Performance

Posted by Bhavesh Mistry <mi...@gmail.com>.
The root of problem is consumer lag on one or two partition even with no op
( read log and discard it) consumer .  Our use case is very simple.  Send
all the log lines to Brokers.  But under storm of data (due to exception or
application error etc), one or two partition gets lags behind while other
consumer are at 0 lag.  We have tune the GC using the recommended GC
setting (according to
http://www.slideshare.net/ToddPalino/enterprise-kafka-kafka-as-a-service
tuning section )   In normal situation, this is ok.

Hashing based on a key, and sticking to Murmur hash(key) % number of
partition did not give did not give a better throughput as compare to
random partitioning.   It would be good to build intelligence about
producer selection based on rate of data for topic and/or lag.   Is there
any way to customize stickiness interval for random partitioning strategy  ?

sorry for late response.

Thanks,

Bhavesh


On Mon, Aug 4, 2014 at 6:50 PM, Joe Stein <jo...@stealth.ly> wrote:

> Bhavesh, take a look at
>
> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyisdatanotevenlydistributedamongpartitionswhenapartitioningkeyisnotspecified
> ?
>
> Maybe the root cause issue is something else? Even if producers produce
> more or less than what they are producing you should be able to make it
> random enough with a partitioner and a key.  I don't think you should need
> more than what is in the FAQ but incase so maybe look into
> http://en.wikipedia.org/wiki/MurmurHash as another hash option.
>
> /*******************************************
>  Joe Stein
>  Founder, Principal Consultant
>  Big Data Open Source Security LLC
>  http://www.stealth.ly
>  Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
> ********************************************/
>
>
> On Mon, Aug 4, 2014 at 9:12 PM, Bhavesh Mistry <mistry.p.bhavesh@gmail.com
> >
> wrote:
>
> > How to achieve uniform distribution of non-keyed messages per topic
> across
> > all partitions?
> >
> > We have tried to do this uniform distribution across partition using
> custom
> > partitioning from each producer instance using round robing (
> > count(messages) % number of partition for topic). This strategy results
> in
> > very poor performance.  So we have switched back to random stickiness
> that
> > Kafka provide out of box per some interval ( 10 minutes not sure exactly
> )
> > per topic.
> >
> > The above strategy results in consumer side lags sometime for some
> > partitions because we have some applications/producers  producing more
> > messages for same topic than other servers.
> >
> > Can Kafka provide out of box uniform distribution by using coordination
> > among all producers and rely on measure rate such as  # messages per
> minute
> > or # of bytes produce per minute to achieve uniform distribution and
> > coordinate stickiness of partition among hundreds of producers for same
> > topic ?
> >
> > Thanks,
> >
> > Bhavesh
> >
>

Re: Uniform Distribution of Messages for Topic Across Partitions Without Effecting Performance

Posted by Joe Stein <jo...@stealth.ly>.
Bhavesh, take a look at
https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyisdatanotevenlydistributedamongpartitionswhenapartitioningkeyisnotspecified
?

Maybe the root cause issue is something else? Even if producers produce
more or less than what they are producing you should be able to make it
random enough with a partitioner and a key.  I don't think you should need
more than what is in the FAQ but incase so maybe look into
http://en.wikipedia.org/wiki/MurmurHash as another hash option.

/*******************************************
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 http://www.stealth.ly
 Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
********************************************/


On Mon, Aug 4, 2014 at 9:12 PM, Bhavesh Mistry <mi...@gmail.com>
wrote:

> How to achieve uniform distribution of non-keyed messages per topic across
> all partitions?
>
> We have tried to do this uniform distribution across partition using custom
> partitioning from each producer instance using round robing (
> count(messages) % number of partition for topic). This strategy results in
> very poor performance.  So we have switched back to random stickiness that
> Kafka provide out of box per some interval ( 10 minutes not sure exactly )
> per topic.
>
> The above strategy results in consumer side lags sometime for some
> partitions because we have some applications/producers  producing more
> messages for same topic than other servers.
>
> Can Kafka provide out of box uniform distribution by using coordination
> among all producers and rely on measure rate such as  # messages per minute
> or # of bytes produce per minute to achieve uniform distribution and
> coordinate stickiness of partition among hundreds of producers for same
> topic ?
>
> Thanks,
>
> Bhavesh
>

Re: Uniform Distribution of Messages for Topic Across Partitions Without Effecting Performance

Posted by Joe Stein <jo...@stealth.ly>.
Bhavesh, take a look at
https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyisdatanotevenlydistributedamongpartitionswhenapartitioningkeyisnotspecified
?

Maybe the root cause issue is something else? Even if producers produce
more or less than what they are producing you should be able to make it
random enough with a partitioner and a key.  I don't think you should need
more than what is in the FAQ but incase so maybe look into
http://en.wikipedia.org/wiki/MurmurHash as another hash option.

/*******************************************
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 http://www.stealth.ly
 Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
********************************************/


On Mon, Aug 4, 2014 at 9:12 PM, Bhavesh Mistry <mi...@gmail.com>
wrote:

> How to achieve uniform distribution of non-keyed messages per topic across
> all partitions?
>
> We have tried to do this uniform distribution across partition using custom
> partitioning from each producer instance using round robing (
> count(messages) % number of partition for topic). This strategy results in
> very poor performance.  So we have switched back to random stickiness that
> Kafka provide out of box per some interval ( 10 minutes not sure exactly )
> per topic.
>
> The above strategy results in consumer side lags sometime for some
> partitions because we have some applications/producers  producing more
> messages for same topic than other servers.
>
> Can Kafka provide out of box uniform distribution by using coordination
> among all producers and rely on measure rate such as  # messages per minute
> or # of bytes produce per minute to achieve uniform distribution and
> coordinate stickiness of partition among hundreds of producers for same
> topic ?
>
> Thanks,
>
> Bhavesh
>