You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Evan Huus <ev...@shopify.com> on 2015/04/09 22:36:37 UTC

Fetch Request Purgatory and Mirrormaker

Hey Folks, we're running into an odd issue with mirrormaker and the fetch
request purgatory on the brokers. Our setup consists of two six-node
clusters (all running 0.8.2.1 on identical hw with the same config). All
"normal" producing and consuming happens on cluster A. Mirrormaker has been
set up to copy all topics (except a tiny blacklist) from cluster A to
cluster B.

Cluster A is completely healthy at the moment. Cluster B is not, which is
very odd since it is literally handling the exact same traffic.

The graph for Fetch Request Purgatory Size looks like this:
https://www.dropbox.com/s/k87wyhzo40h8gnk/Screenshot%202015-04-09%2016.08.37.png?dl=0

Every time the purgatory shrinks, the latency from that causes one or more
nodes to drop their leadership (it quickly recovers). We could probably
alleviate the symptoms by decreasing
`fetch.purgatory.purge.interval.requests` (it is currently at the default
value) but I'd rather try and understand/solve the root cause here.

Cluster B is handling no outside fetch requests, and turning mirrormaker
off "fixes" the problem, so clearly (since mirrormaker is producing to this
cluster not consuming from it) the fetch requests must be coming from
internal replication. However, the same data is being replicated when it is
originally produced in cluster A, and the fetch purgatory size sits stably
at ~10k there. There is nothing unusual in the logs on either cluster.

I have read all the wiki pages and jira tickets I can find about the new
purgatory design in 0.8.2 but nothing stands out as applicable. I'm happy
to provide more detailed logs, configuration, etc. if anyone thinks there
might be something important in there. I am completely baffled as to what
could be causing this.

Any suggestions would be appreciated. I'm starting to think at this point
that we've completely misunderstood or misconfigured *something*.

Thanks,
Evan

Re: Fetch Request Purgatory and Mirrormaker

Posted by Evan Huus <ev...@shopify.com>.

This is still occurring for us. In addition, it has started occurring on
one of the six nodes in the "healthy" cluster, for no reason we have been
able to determine.

We're willing to put in some serious time to help debug/solve this, but we
need *some* hint as to where to start. I understand that purgatory has been
rewritten (again) in 0.8.3, so might it be worth trying a trunk build? Is
there an ETA for a beta release of 0.8.3?

Thanks,
Evan

On Tue, Apr 14, 2015 at 8:40 PM, Evan Huus <ev...@shopify.com> wrote:

> On Tue, Apr 14, 2015 at 8:31 PM, Jiangjie Qin <jq...@linkedin.com.invalid>
> wrote:
>
>> Hey Evan,
>>
>> Is this issue only observed when mirror maker is consuming? It looks that
>> for Cluster A you have some other consumers.
>> Do you mean if you stop mirror maker the problem goes away?
>>
>
> Yes, exactly. The setup is A -> Mirrormaker -> B so mirrormaker is
> consuming from A and producing to B.
>
> Cluster A is always fine. Cluster B is fine when mirrormaker is stopped.
> Cluster B has the weird purgatory issue when mirrormaker is running.
>
> Today I rolled out a change to reduce the
> `fetch.purgatory.purge.interval.requests` and
> `producer.purgatory.purge.interval.requests` configuration values on
> cluster B from 1000 to 200, but it had no effect, which I find really weird.
>
> Thanks,
> Evan
>
>
>> Jiangjie (Becket) Qin
>>
>> On 4/14/15, 6:55 AM, "Evan Huus" <ev...@shopify.com> wrote:
>>
>> >Any ideas on this? It's still occurring...
>> >
>> >Is there a separate mailing list or project for mirrormaker that I could
>> >ask?
>> >
>> >Thanks,
>> >Evan
>> >
>> >On Thu, Apr 9, 2015 at 4:36 PM, Evan Huus <ev...@shopify.com> wrote:
>> >
>> >> Hey Folks, we're running into an odd issue with mirrormaker and the
>> >>fetch
>> >> request purgatory on the brokers. Our setup consists of two six-node
>> >> clusters (all running 0.8.2.1 on identical hw with the same config).
>> All
>> >> "normal" producing and consuming happens on cluster A. Mirrormaker has
>> >>been
>> >> set up to copy all topics (except a tiny blacklist) from cluster A to
>> >> cluster B.
>> >>
>> >> Cluster A is completely healthy at the moment. Cluster B is not, which
>> >>is
>> >> very odd since it is literally handling the exact same traffic.
>> >>
>> >> The graph for Fetch Request Purgatory Size looks like this:
>> >>
>> >>
>> https://www.dropbox.com/s/k87wyhzo40h8gnk/Screenshot%202015-04-09%2016.08
>> >>.37.png?dl=0
>> >>
>> >> Every time the purgatory shrinks, the latency from that causes one or
>> >>more
>> >> nodes to drop their leadership (it quickly recovers). We could probably
>> >> alleviate the symptoms by decreasing
>> >> `fetch.purgatory.purge.interval.requests` (it is currently at the
>> >>default
>> >> value) but I'd rather try and understand/solve the root cause here.
>> >>
>> >> Cluster B is handling no outside fetch requests, and turning
>> mirrormaker
>> >> off "fixes" the problem, so clearly (since mirrormaker is producing to
>> >>this
>> >> cluster not consuming from it) the fetch requests must be coming from
>> >> internal replication. However, the same data is being replicated when
>> >>it is
>> >> originally produced in cluster A, and the fetch purgatory size sits
>> >>stably
>> >> at ~10k there. There is nothing unusual in the logs on either cluster.
>> >>
>> >> I have read all the wiki pages and jira tickets I can find about the
>> new
>> >> purgatory design in 0.8.2 but nothing stands out as applicable. I'm
>> >>happy
>> >> to provide more detailed logs, configuration, etc. if anyone thinks
>> >>there
>> >> might be something important in there. I am completely baffled as to
>> >>what
>> >> could be causing this.
>> >>
>> >> Any suggestions would be appreciated. I'm starting to think at this
>> >>point
>> >> that we've completely misunderstood or misconfigured *something*.
>> >>
>> >> Thanks,
>> >> Evan
>> >>
>>
>>
>

Re: Fetch Request Purgatory and Mirrormaker

Posted by Evan Huus <ev...@shopify.com>.

On Tue, Apr 14, 2015 at 8:31 PM, Jiangjie Qin <jq...@linkedin.com.invalid>
wrote:

> Hey Evan,
>
> Is this issue only observed when mirror maker is consuming? It looks that
> for Cluster A you have some other consumers.
> Do you mean if you stop mirror maker the problem goes away?
>

Yes, exactly. The setup is A -> Mirrormaker -> B so mirrormaker is
consuming from A and producing to B.

Cluster A is always fine. Cluster B is fine when mirrormaker is stopped.
Cluster B has the weird purgatory issue when mirrormaker is running.

Today I rolled out a change to reduce the
`fetch.purgatory.purge.interval.requests` and
`producer.purgatory.purge.interval.requests` configuration values on
cluster B from 1000 to 200, but it had no effect, which I find really weird.

Thanks,
Evan


> Jiangjie (Becket) Qin
>
> On 4/14/15, 6:55 AM, "Evan Huus" <ev...@shopify.com> wrote:
>
> >Any ideas on this? It's still occurring...
> >
> >Is there a separate mailing list or project for mirrormaker that I could
> >ask?
> >
> >Thanks,
> >Evan
> >
> >On Thu, Apr 9, 2015 at 4:36 PM, Evan Huus <ev...@shopify.com> wrote:
> >
> >> Hey Folks, we're running into an odd issue with mirrormaker and the
> >>fetch
> >> request purgatory on the brokers. Our setup consists of two six-node
> >> clusters (all running 0.8.2.1 on identical hw with the same config). All
> >> "normal" producing and consuming happens on cluster A. Mirrormaker has
> >>been
> >> set up to copy all topics (except a tiny blacklist) from cluster A to
> >> cluster B.
> >>
> >> Cluster A is completely healthy at the moment. Cluster B is not, which
> >>is
> >> very odd since it is literally handling the exact same traffic.
> >>
> >> The graph for Fetch Request Purgatory Size looks like this:
> >>
> >>
> https://www.dropbox.com/s/k87wyhzo40h8gnk/Screenshot%202015-04-09%2016.08
> >>.37.png?dl=0
> >>
> >> Every time the purgatory shrinks, the latency from that causes one or
> >>more
> >> nodes to drop their leadership (it quickly recovers). We could probably
> >> alleviate the symptoms by decreasing
> >> `fetch.purgatory.purge.interval.requests` (it is currently at the
> >>default
> >> value) but I'd rather try and understand/solve the root cause here.
> >>
> >> Cluster B is handling no outside fetch requests, and turning mirrormaker
> >> off "fixes" the problem, so clearly (since mirrormaker is producing to
> >>this
> >> cluster not consuming from it) the fetch requests must be coming from
> >> internal replication. However, the same data is being replicated when
> >>it is
> >> originally produced in cluster A, and the fetch purgatory size sits
> >>stably
> >> at ~10k there. There is nothing unusual in the logs on either cluster.
> >>
> >> I have read all the wiki pages and jira tickets I can find about the new
> >> purgatory design in 0.8.2 but nothing stands out as applicable. I'm
> >>happy
> >> to provide more detailed logs, configuration, etc. if anyone thinks
> >>there
> >> might be something important in there. I am completely baffled as to
> >>what
> >> could be causing this.
> >>
> >> Any suggestions would be appreciated. I'm starting to think at this
> >>point
> >> that we've completely misunderstood or misconfigured *something*.
> >>
> >> Thanks,
> >> Evan
> >>
>
>

Re: Fetch Request Purgatory and Mirrormaker

Posted by Jiangjie Qin <jq...@linkedin.com.INVALID>.

Hey Evan,

Is this issue only observed when mirror maker is consuming? It looks that
for Cluster A you have some other consumers.
Do you mean if you stop mirror maker the problem goes away?

Jiangjie (Becket) Qin

On 4/14/15, 6:55 AM, "Evan Huus" <ev...@shopify.com> wrote:

>Any ideas on this? It's still occurring...
>
>Is there a separate mailing list or project for mirrormaker that I could
>ask?
>
>Thanks,
>Evan
>
>On Thu, Apr 9, 2015 at 4:36 PM, Evan Huus <ev...@shopify.com> wrote:
>
>> Hey Folks, we're running into an odd issue with mirrormaker and the
>>fetch
>> request purgatory on the brokers. Our setup consists of two six-node
>> clusters (all running 0.8.2.1 on identical hw with the same config). All
>> "normal" producing and consuming happens on cluster A. Mirrormaker has
>>been
>> set up to copy all topics (except a tiny blacklist) from cluster A to
>> cluster B.
>>
>> Cluster A is completely healthy at the moment. Cluster B is not, which
>>is
>> very odd since it is literally handling the exact same traffic.
>>
>> The graph for Fetch Request Purgatory Size looks like this:
>> 
>>https://www.dropbox.com/s/k87wyhzo40h8gnk/Screenshot%202015-04-09%2016.08
>>.37.png?dl=0
>>
>> Every time the purgatory shrinks, the latency from that causes one or
>>more
>> nodes to drop their leadership (it quickly recovers). We could probably
>> alleviate the symptoms by decreasing
>> `fetch.purgatory.purge.interval.requests` (it is currently at the
>>default
>> value) but I'd rather try and understand/solve the root cause here.
>>
>> Cluster B is handling no outside fetch requests, and turning mirrormaker
>> off "fixes" the problem, so clearly (since mirrormaker is producing to
>>this
>> cluster not consuming from it) the fetch requests must be coming from
>> internal replication. However, the same data is being replicated when
>>it is
>> originally produced in cluster A, and the fetch purgatory size sits
>>stably
>> at ~10k there. There is nothing unusual in the logs on either cluster.
>>
>> I have read all the wiki pages and jira tickets I can find about the new
>> purgatory design in 0.8.2 but nothing stands out as applicable. I'm
>>happy
>> to provide more detailed logs, configuration, etc. if anyone thinks
>>there
>> might be something important in there. I am completely baffled as to
>>what
>> could be causing this.
>>
>> Any suggestions would be appreciated. I'm starting to think at this
>>point
>> that we've completely misunderstood or misconfigured *something*.
>>
>> Thanks,
>> Evan
>>

Re: Fetch Request Purgatory and Mirrormaker

Posted by Evan Huus <ev...@shopify.com>.

Any ideas on this? It's still occurring...

Is there a separate mailing list or project for mirrormaker that I could
ask?

Thanks,
Evan

On Thu, Apr 9, 2015 at 4:36 PM, Evan Huus <ev...@shopify.com> wrote:

> Hey Folks, we're running into an odd issue with mirrormaker and the fetch
> request purgatory on the brokers. Our setup consists of two six-node
> clusters (all running 0.8.2.1 on identical hw with the same config). All
> "normal" producing and consuming happens on cluster A. Mirrormaker has been
> set up to copy all topics (except a tiny blacklist) from cluster A to
> cluster B.
>
> Cluster A is completely healthy at the moment. Cluster B is not, which is
> very odd since it is literally handling the exact same traffic.
>
> The graph for Fetch Request Purgatory Size looks like this:
> https://www.dropbox.com/s/k87wyhzo40h8gnk/Screenshot%202015-04-09%2016.08.37.png?dl=0
>
> Every time the purgatory shrinks, the latency from that causes one or more
> nodes to drop their leadership (it quickly recovers). We could probably
> alleviate the symptoms by decreasing
> `fetch.purgatory.purge.interval.requests` (it is currently at the default
> value) but I'd rather try and understand/solve the root cause here.
>
> Cluster B is handling no outside fetch requests, and turning mirrormaker
> off "fixes" the problem, so clearly (since mirrormaker is producing to this
> cluster not consuming from it) the fetch requests must be coming from
> internal replication. However, the same data is being replicated when it is
> originally produced in cluster A, and the fetch purgatory size sits stably
> at ~10k there. There is nothing unusual in the logs on either cluster.
>
> I have read all the wiki pages and jira tickets I can find about the new
> purgatory design in 0.8.2 but nothing stands out as applicable. I'm happy
> to provide more detailed logs, configuration, etc. if anyone thinks there
> might be something important in there. I am completely baffled as to what
> could be causing this.
>
> Any suggestions would be appreciated. I'm starting to think at this point
> that we've completely misunderstood or misconfigured *something*.
>
> Thanks,
> Evan
>