You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Javier Holguera <ja...@fundingcircle.com> on 2016/09/01 16:24:28 UTC

Producer: metadata refresh when leader down

Hi,

Until recently, I thought that the process for producers and metadata went
like this:
1. Producer published to a leader broker that was down, which would fail.
2. Producer tried a few times (or zero, depending on config).
3. Eventually Producer would “fail” publishing that message.
4. That would trigger Producer contacting the brokers to fetch a new
metadata block, so it could find the newer leader and continue.

However what I’m observing is the Producer blocking after exhausting the
retries and not doing anything until the metadata refreshes
“automatically”. This refresh would be based on the time configured in this
property:

metadata.max.age.ms The period of time in milliseconds after which we force
a refresh of metadata even if we haven't seen any partition leadership
changes to proactively discover any new brokers or partitions.

So basically, if the Producer happens to block near to the time where the
metadata will expire itself, the production would recover quickly. However
if the Producer blocks a few seconds after the last automatic refresh has
happened, considering the default for the property is 5 minutes, the
Producer would be blocked pretty much all that time.

Is there something I’m missing or not understanding correctly?

Thanks for your help!

Regards,
Javier.

-- 
Javier Holguera
Sent with Airmail

Re: Producer: metadata refresh when leader down

Posted by Javier Holguera <ja...@fundingcircle.com>.

Hi Yuto,

Thanks for the link. We aren’t setting `retry.backoff.ms` so it would be
running with its default of 1 sec.

Correct me if I’m wrong, but in that case we may experience a 1 sec delay
pulling the metadata. But we are observing long waits, sometimes for
minutes.

That’s why I thought the `metadata.max.age.ms` is the key here because the
shorter we set it, the quicker producers recover from leader crash.

Regards,
Javier.

-- 
Javier Holguera
Sent with Airmail

On 2 September 2016 at 15:50:48, Yuto KAWAMURA (kawamuray.dadada@gmail.com)
wrote:

HI Javier,

Not sure but just wondering if this could be related to your case:
https://issues.apache.org/jira/browse/KAFKA-4024?focusedCommentId=15458639&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15458639
Are you sure that metadata expiration is exactly the trigger of
"automatic refresh" that you observed?
If you're setting `retry.backoff.ms` to relatively larger value, you
might wanna take a look into the above issue and compare the timestamp
of logs like "DEBUG Updated cluster metadata version N to...". If the
interval between two metadata updates corresponds to the value of
`retry.backoff.ms` then the above issue should be the cause.

Cheers,
Yuto

2016-09-02 1:24 GMT+09:00 Javier Holguera <ja...@fundingcircle.com>:

> Hi,
>
> Until recently, I thought that the process for producers and metadata
went
> like this:
> 1. Producer published to a leader broker that was down, which would fail.
> 2. Producer tried a few times (or zero, depending on config).
> 3. Eventually Producer would “fail” publishing that message.
> 4. That would trigger Producer contacting the brokers to fetch a new
> metadata block, so it could find the newer leader and continue.
>
> However what I’m observing is the Producer blocking after exhausting the
> retries and not doing anything until the metadata refreshes
> “automatically”. This refresh would be based on the time configured in
this
> property:
>
> metadata.max.age.ms The period of time in milliseconds after which we
force
> a refresh of metadata even if we haven't seen any partition leadership
> changes to proactively discover any new brokers or partitions.
>
> So basically, if the Producer happens to block near to the time where the
> metadata will expire itself, the production would recover quickly.
However
> if the Producer blocks a few seconds after the last automatic refresh has
> happened, considering the default for the property is 5 minutes, the
> Producer would be blocked pretty much all that time.
>
> Is there something I’m missing or not understanding correctly?
>
> Thanks for your help!
>
> Regards,
> Javier.
>
> --
> Javier Holguera
> Sent with Airmail

Re: Producer: metadata refresh when leader down

Posted by Yuto KAWAMURA <ka...@gmail.com>.

HI Javier,

Not sure but just wondering if this could be related to your case:
https://issues.apache.org/jira/browse/KAFKA-4024?focusedCommentId=15458639&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15458639
Are you sure that metadata expiration is exactly the trigger of
"automatic refresh" that you observed?
If you're setting `retry.backoff.ms` to relatively larger value, you
might wanna take a look into the above issue and compare the timestamp
of logs like "DEBUG Updated cluster metadata version N to...". If the
interval between two metadata updates corresponds to the value of
`retry.backoff.ms` then the above issue should be the cause.

Cheers,
Yuto

2016-09-02 1:24 GMT+09:00 Javier Holguera <ja...@fundingcircle.com>:
> Hi,
>
> Until recently, I thought that the process for producers and metadata went
> like this:
> 1. Producer published to a leader broker that was down, which would fail.
> 2. Producer tried a few times (or zero, depending on config).
> 3. Eventually Producer would “fail” publishing that message.
> 4. That would trigger Producer contacting the brokers to fetch a new
> metadata block, so it could find the newer leader and continue.
>
> However what I’m observing is the Producer blocking after exhausting the
> retries and not doing anything until the metadata refreshes
> “automatically”. This refresh would be based on the time configured in this
> property:
>
> metadata.max.age.ms The period of time in milliseconds after which we force
> a refresh of metadata even if we haven't seen any partition leadership
> changes to proactively discover any new brokers or partitions.
>
> So basically, if the Producer happens to block near to the time where the
> metadata will expire itself, the production would recover quickly. However
> if the Producer blocks a few seconds after the last automatic refresh has
> happened, considering the default for the property is 5 minutes, the
> Producer would be blocked pretty much all that time.
>
> Is there something I’m missing or not understanding correctly?
>
> Thanks for your help!
>
> Regards,
> Javier.
>
> --
> Javier Holguera
> Sent with Airmail