You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Robert Metzger <rm...@apache.org> on 2015/07/15 18:24:07 UTC

Re: Consumer that consumes only local partition?

Hi Shef,

did you resolve this issue?
I'm facing some performance issues and I was wondering whether reading
locally would resolve them.

On Mon, Jun 22, 2015 at 11:43 PM, Shef <sh...@yahoo.com> wrote:

> Noob question here. I want to have a single consumer for each partition
> that consumes only the messages that have been written locally. In other
> words, I want the consumer to access the local disk and not pull anything
> across the network. Possible?
>
> How can I discover which partitions are local?
>
>
>

Re: Consumer that consumes only local partition?

Posted by Hawin Jiang <ha...@gmail.com>.
Hi  Robert

Here is the kafka benchmark for your reference.
if you want to use Flink, Storm, Samza or Spark, the performance will be
going down.

821,557 records/sec(78.3 MB/sec)

https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines





Best regards
Hawin



On Tue, Aug 4, 2015 at 11:57 AM, Robert Metzger <rm...@apache.org> wrote:

> Sorry for the very late reply ...
>
> The performance issue was not caused by network latency. I had a job like
> this:
> FlinkKafkaConsumer --> someSimpleOperation --> FlinkKafkaProducer.
>
> I thought that our FlinkKafkaConsumer is slow, but actually our
> FlinkKafkaProducer was using the old producer API of Kafka. Switching to
> the new producer API of Kafka greatly improved our writing performance to
> Kafka. Flink was slowing down the KafkaConsumer because of the producer.
>
> Since we are already talking about performance, let me ask you the
> following question:
> I am using Kafka and Flink on a HDP 2.2 cluster (with 40 machines). What
> would you consider a good read/write performance for 8-byte messages on the
> following setup?
> - 40 brokers,
> - topic with 120 partitions
> - 120 reading threads (on 30 machines)
> - 120 writing threads (on 30 machines)
>
> I'm getting a write throughput of ~75k elements/core/second and a read
> throughput of ~50k el/c/s.
> When I'm stopping the writers, the read throughput goes up to 130k.
> I would expect a higher throughput than (8*75000) / 1024 = 585.9 kb/sec per
> partition .. or are the messages too small and the overhead is very high.
>
> Which system out there would you recommend for getting reference
> performance numbers? Samza, Spark, Storm?
>
>
> On Wed, Jul 15, 2015 at 7:20 PM, Gwen Shapira <gs...@cloudera.com>
> wrote:
>
> > This is not something you can use the consumer API to simply do easily
> > (consumers don't have locality notion).
> > I can imagine using Kafka's low-level API calls to get a list of
> > partitions and the lead replica, figuring out which are local and
> > using those - but that sounds painful.
> >
> > Are you 100% sure the performance issue is due to network latency? If
> > not, you may want to start optimizing somewhere more productive :)
> > Kafka brokers and clients both have Metrics that may help you track
> > where the performance issues are coming from.
> >
> > Gwen
> >
> > On Wed, Jul 15, 2015 at 9:24 AM, Robert Metzger <rm...@apache.org>
> > wrote:
> > > Hi Shef,
> > >
> > > did you resolve this issue?
> > > I'm facing some performance issues and I was wondering whether reading
> > > locally would resolve them.
> > >
> > > On Mon, Jun 22, 2015 at 11:43 PM, Shef <sh...@yahoo.com> wrote:
> > >
> > >> Noob question here. I want to have a single consumer for each
> partition
> > >> that consumes only the messages that have been written locally. In
> other
> > >> words, I want the consumer to access the local disk and not pull
> > anything
> > >> across the network. Possible?
> > >>
> > >> How can I discover which partitions are local?
> > >>
> > >>
> > >>
> >
>

Re: Consumer that consumes only local partition?

Posted by Robert Metzger <rm...@apache.org>.
Sorry for the very late reply ...

The performance issue was not caused by network latency. I had a job like
this:
FlinkKafkaConsumer --> someSimpleOperation --> FlinkKafkaProducer.

I thought that our FlinkKafkaConsumer is slow, but actually our
FlinkKafkaProducer was using the old producer API of Kafka. Switching to
the new producer API of Kafka greatly improved our writing performance to
Kafka. Flink was slowing down the KafkaConsumer because of the producer.

Since we are already talking about performance, let me ask you the
following question:
I am using Kafka and Flink on a HDP 2.2 cluster (with 40 machines). What
would you consider a good read/write performance for 8-byte messages on the
following setup?
- 40 brokers,
- topic with 120 partitions
- 120 reading threads (on 30 machines)
- 120 writing threads (on 30 machines)

I'm getting a write throughput of ~75k elements/core/second and a read
throughput of ~50k el/c/s.
When I'm stopping the writers, the read throughput goes up to 130k.
I would expect a higher throughput than (8*75000) / 1024 = 585.9 kb/sec per
partition .. or are the messages too small and the overhead is very high.

Which system out there would you recommend for getting reference
performance numbers? Samza, Spark, Storm?


On Wed, Jul 15, 2015 at 7:20 PM, Gwen Shapira <gs...@cloudera.com> wrote:

> This is not something you can use the consumer API to simply do easily
> (consumers don't have locality notion).
> I can imagine using Kafka's low-level API calls to get a list of
> partitions and the lead replica, figuring out which are local and
> using those - but that sounds painful.
>
> Are you 100% sure the performance issue is due to network latency? If
> not, you may want to start optimizing somewhere more productive :)
> Kafka brokers and clients both have Metrics that may help you track
> where the performance issues are coming from.
>
> Gwen
>
> On Wed, Jul 15, 2015 at 9:24 AM, Robert Metzger <rm...@apache.org>
> wrote:
> > Hi Shef,
> >
> > did you resolve this issue?
> > I'm facing some performance issues and I was wondering whether reading
> > locally would resolve them.
> >
> > On Mon, Jun 22, 2015 at 11:43 PM, Shef <sh...@yahoo.com> wrote:
> >
> >> Noob question here. I want to have a single consumer for each partition
> >> that consumes only the messages that have been written locally. In other
> >> words, I want the consumer to access the local disk and not pull
> anything
> >> across the network. Possible?
> >>
> >> How can I discover which partitions are local?
> >>
> >>
> >>
>

Re: Consumer that consumes only local partition?

Posted by Gwen Shapira <gs...@cloudera.com>.
This is not something you can use the consumer API to simply do easily
(consumers don't have locality notion).
I can imagine using Kafka's low-level API calls to get a list of
partitions and the lead replica, figuring out which are local and
using those - but that sounds painful.

Are you 100% sure the performance issue is due to network latency? If
not, you may want to start optimizing somewhere more productive :)
Kafka brokers and clients both have Metrics that may help you track
where the performance issues are coming from.

Gwen

On Wed, Jul 15, 2015 at 9:24 AM, Robert Metzger <rm...@apache.org> wrote:
> Hi Shef,
>
> did you resolve this issue?
> I'm facing some performance issues and I was wondering whether reading
> locally would resolve them.
>
> On Mon, Jun 22, 2015 at 11:43 PM, Shef <sh...@yahoo.com> wrote:
>
>> Noob question here. I want to have a single consumer for each partition
>> that consumes only the messages that have been written locally. In other
>> words, I want the consumer to access the local disk and not pull anything
>> across the network. Possible?
>>
>> How can I discover which partitions are local?
>>
>>
>>