You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Jay Kreps <ja...@gmail.com> on 2014/02/08 22:22:37 UTC

Inconsistent latency with replication

Hey guys,

I was running the end-to-end latency test (kafka.TestEndToEndLatency) and
saw something a little weird. This test runs a producer and a consumer and
sends a single message at a time and measures the round-trip time from the
producer's send to the consumer getting the message.

With replication-factor=1 I see very consistent performance with end-to-end
latency at 0.4-0.5 ms which is extremely good.

But with replication factor=2 I see something like this:

count   latency
1000 1.9 ms
2000 1.8 ms
3000 1.4 ms
4000 1.7 ms
5000 102.6 ms
6000 101.4 ms
7000 102.4 ms
8000 1.6 ms
9000 101.5 ms

This pattern is very reproducible, essentially every 4-5k messages things
slow down to an average round trip of 100ms and then pick back up again.

Note that this test is not using the new producer.

Have we seen this before. The issue could be in the producer
acknowledgement or in the highwatermark advancement or fetch request, but I
notice that the default fetch max wait is 100ms which makes me think there
is a bug in the async request handling that causes it to wait until the
timeout. Any ideas? If not I'll file a bug...

-Jay

Re: Inconsistent latency with replication

Posted by Jun Rao <ju...@gmail.com>.

This is probably caused by the general issue in purgatory. Basically, the
check on whether a request is satisfied or not and the watcher registration
is not atomic. So, in this case, when a replica fetch request comes in, it
could be that the byte check is not satisfied, but before the fetch request
is put into purgatory, a produce request sneaks in. When there is only a
single producer client, this replica fetch request has to wait for the full
timeout. When handling producer requests, we address this issue by checking
satisfied again after watcher registration. We haven't done that for the
fetch request yet.

Thanks,

Jun

On Sat, Feb 8, 2014 at 1:22 PM, Jay Kreps <ja...@gmail.com> wrote:

> Hey guys,
>
> I was running the end-to-end latency test (kafka.TestEndToEndLatency) and
> saw something a little weird. This test runs a producer and a consumer and
> sends a single message at a time and measures the round-trip time from the
> producer's send to the consumer getting the message.
>
> With replication-factor=1 I see very consistent performance with end-to-end
> latency at 0.4-0.5 ms which is extremely good.
>
> But with replication factor=2 I see something like this:
>
> count   latency
> 1000 1.9 ms
> 2000 1.8 ms
> 3000 1.4 ms
> 4000 1.7 ms
> 5000 102.6 ms
> 6000 101.4 ms
> 7000 102.4 ms
> 8000 1.6 ms
> 9000 101.5 ms
>
> This pattern is very reproducible, essentially every 4-5k messages things
> slow down to an average round trip of 100ms and then pick back up again.
>
> Note that this test is not using the new producer.
>
> Have we seen this before. The issue could be in the producer
> acknowledgement or in the highwatermark advancement or fetch request, but I
> notice that the default fetch max wait is 100ms which makes me think there
> is a bug in the async request handling that causes it to wait until the
> timeout. Any ideas? If not I'll file a bug...
>
> -Jay
>

Re: Inconsistent latency with replication

Posted by Jay Kreps <ja...@gmail.com>.

Also set to the default, which is 1.

-Jay


On Sat, Feb 8, 2014 at 1:31 PM, Sriram Subramanian <
srsubramanian@linkedin.com> wrote:

> What is replica.fetch.min.bytes set to?
>
> On 2/8/14 1:22 PM, "Jay Kreps" <ja...@gmail.com> wrote:
>
> >Hey guys,
> >
> >I was running the end-to-end latency test (kafka.TestEndToEndLatency) and
> >saw something a little weird. This test runs a producer and a consumer and
> >sends a single message at a time and measures the round-trip time from the
> >producer's send to the consumer getting the message.
> >
> >With replication-factor=1 I see very consistent performance with
> >end-to-end
> >latency at 0.4-0.5 ms which is extremely good.
> >
> >But with replication factor=2 I see something like this:
> >
> >count   latency
> >1000 1.9 ms
> >2000 1.8 ms
> >3000 1.4 ms
> >4000 1.7 ms
> >5000 102.6 ms
> >6000 101.4 ms
> >7000 102.4 ms
> >8000 1.6 ms
> >9000 101.5 ms
> >
> >This pattern is very reproducible, essentially every 4-5k messages things
> >slow down to an average round trip of 100ms and then pick back up again.
> >
> >Note that this test is not using the new producer.
> >
> >Have we seen this before. The issue could be in the producer
> >acknowledgement or in the highwatermark advancement or fetch request, but
> >I
> >notice that the default fetch max wait is 100ms which makes me think there
> >is a bug in the async request handling that causes it to wait until the
> >timeout. Any ideas? If not I'll file a bug...
> >
> >-Jay
>
>

Re: Inconsistent latency with replication

Posted by Sriram Subramanian <sr...@linkedin.com>.

What is replica.fetch.min.bytes set to?

On 2/8/14 1:22 PM, "Jay Kreps" <ja...@gmail.com> wrote:

>Hey guys,
>
>I was running the end-to-end latency test (kafka.TestEndToEndLatency) and
>saw something a little weird. This test runs a producer and a consumer and
>sends a single message at a time and measures the round-trip time from the
>producer's send to the consumer getting the message.
>
>With replication-factor=1 I see very consistent performance with
>end-to-end
>latency at 0.4-0.5 ms which is extremely good.
>
>But with replication factor=2 I see something like this:
>
>count   latency
>1000 1.9 ms
>2000 1.8 ms
>3000 1.4 ms
>4000 1.7 ms
>5000 102.6 ms
>6000 101.4 ms
>7000 102.4 ms
>8000 1.6 ms
>9000 101.5 ms
>
>This pattern is very reproducible, essentially every 4-5k messages things
>slow down to an average round trip of 100ms and then pick back up again.
>
>Note that this test is not using the new producer.
>
>Have we seen this before. The issue could be in the producer
>acknowledgement or in the highwatermark advancement or fetch request, but
>I
>notice that the default fetch max wait is 100ms which makes me think there
>is a bug in the async request handling that causes it to wait until the
>timeout. Any ideas? If not I'll file a bug...
>
>-Jay