You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Kris K <sq...@gmail.com> on 2016/02/23 06:37:21 UTC

new producer failed with org.apache.kafka.common.errors.TimeoutException

Hi All,

I saw an issue today wherein the producers (new producers) started to fail
with org.apache.kafka.common.errors.TimeoutException: Failed to update
metadata after 60000 ms.

This issue happened when we took down one of the 6 brokers (running version
0.8.2.1) for planned maintenance (graceful shutdown).

This broker happens to be the last one in the list of 3 brokers that are
part of bootstrap.servers.

As per my understanding, the producers should have used the other two
brokers in the bootstrap.servers list for metadata calls. But this did not
happen.

Is there any producer property that could have caused this? Any way to
figure out which broker is being used by producers for metadata calls?

Thanks,
Kris

Re: new producer failed with org.apache.kafka.common.errors.TimeoutException

Posted by Kris K <sq...@gmail.com>.

Thanks Ewen. I will get back with findings from my test run using 0.9.0.1
client.

Regards,
Kris

On Tue, Feb 23, 2016 at 9:26 PM, Ewen Cheslack-Postava <ew...@confluent.io>
wrote:

> Kris,
>
> This is a bit surprising, but handling the bootstrap servers, broker
> failures/retirement, and cluster metadata properly is surprisingly hard to
> get right!
>
> https://issues.apache.org/jira/browse/KAFKA-1843 explains some of the
> challenges. https://issues.apache.org/jira/browse/KAFKA-3068 shows the
> types of issues that can result from trying to better recover from failures
> or your situation of graceful shutdown.
>
> I think https://issues.apache.org/jira/browse/KAFKA-2459 might have
> addressed the incorrect behavior you are seeing in 0.8.2.1 -- the same
> bootstrap broker could be selected due to incorrect handling of
> backoff/timeouts. I can't be sure without more info, but it sounds like it
> could be the same issue. Despite part of the fix being rolled back due to
> KAFKA-3068, I think the relevant part which fixes the timeouts should still
> be present in 0.9.0.1. If you can easily reproduce, could you test if the
> newer release fixes the issue for you?
>
> -Ewen
>
> On Mon, Feb 22, 2016 at 9:37 PM, Kris K <sq...@gmail.com> wrote:
>
> > Hi All,
> >
> > I saw an issue today wherein the producers (new producers) started to
> fail
> > with org.apache.kafka.common.errors.TimeoutException: Failed to update
> > metadata after 60000 ms.
> >
> > This issue happened when we took down one of the 6 brokers (running
> version
> > 0.8.2.1) for planned maintenance (graceful shutdown).
> >
> > This broker happens to be the last one in the list of 3 brokers that are
> > part of bootstrap.servers.
> >
> > As per my understanding, the producers should have used the other two
> > brokers in the bootstrap.servers list for metadata calls. But this did
> not
> > happen.
> >
> > Is there any producer property that could have caused this? Any way to
> > figure out which broker is being used by producers for metadata calls?
> >
> > Thanks,
> > Kris
> >
>
>
>
> --
> Thanks,
> Ewen
>

Re: new producer failed with org.apache.kafka.common.errors.TimeoutException

Posted by Kris K <sq...@gmail.com>.

Thanks Ewen. I will get back with findings from my test run using 0.9.0.1
client.

Regards,
Kris

On Tue, Feb 23, 2016 at 9:26 PM, Ewen Cheslack-Postava <ew...@confluent.io>
wrote:

> Kris,
>
> This is a bit surprising, but handling the bootstrap servers, broker
> failures/retirement, and cluster metadata properly is surprisingly hard to
> get right!
>
> https://issues.apache.org/jira/browse/KAFKA-1843 explains some of the
> challenges. https://issues.apache.org/jira/browse/KAFKA-3068 shows the
> types of issues that can result from trying to better recover from failures
> or your situation of graceful shutdown.
>
> I think https://issues.apache.org/jira/browse/KAFKA-2459 might have
> addressed the incorrect behavior you are seeing in 0.8.2.1 -- the same
> bootstrap broker could be selected due to incorrect handling of
> backoff/timeouts. I can't be sure without more info, but it sounds like it
> could be the same issue. Despite part of the fix being rolled back due to
> KAFKA-3068, I think the relevant part which fixes the timeouts should still
> be present in 0.9.0.1. If you can easily reproduce, could you test if the
> newer release fixes the issue for you?
>
> -Ewen
>
> On Mon, Feb 22, 2016 at 9:37 PM, Kris K <sq...@gmail.com> wrote:
>
> > Hi All,
> >
> > I saw an issue today wherein the producers (new producers) started to
> fail
> > with org.apache.kafka.common.errors.TimeoutException: Failed to update
> > metadata after 60000 ms.
> >
> > This issue happened when we took down one of the 6 brokers (running
> version
> > 0.8.2.1) for planned maintenance (graceful shutdown).
> >
> > This broker happens to be the last one in the list of 3 brokers that are
> > part of bootstrap.servers.
> >
> > As per my understanding, the producers should have used the other two
> > brokers in the bootstrap.servers list for metadata calls. But this did
> not
> > happen.
> >
> > Is there any producer property that could have caused this? Any way to
> > figure out which broker is being used by producers for metadata calls?
> >
> > Thanks,
> > Kris
> >
>
>
>
> --
> Thanks,
> Ewen
>

Re: new producer failed with org.apache.kafka.common.errors.TimeoutException

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.

Kris,

This is a bit surprising, but handling the bootstrap servers, broker
failures/retirement, and cluster metadata properly is surprisingly hard to
get right!

https://issues.apache.org/jira/browse/KAFKA-1843 explains some of the
challenges. https://issues.apache.org/jira/browse/KAFKA-3068 shows the
types of issues that can result from trying to better recover from failures
or your situation of graceful shutdown.

I think https://issues.apache.org/jira/browse/KAFKA-2459 might have
addressed the incorrect behavior you are seeing in 0.8.2.1 -- the same
bootstrap broker could be selected due to incorrect handling of
backoff/timeouts. I can't be sure without more info, but it sounds like it
could be the same issue. Despite part of the fix being rolled back due to
KAFKA-3068, I think the relevant part which fixes the timeouts should still
be present in 0.9.0.1. If you can easily reproduce, could you test if the
newer release fixes the issue for you?

-Ewen

On Mon, Feb 22, 2016 at 9:37 PM, Kris K <sq...@gmail.com> wrote:

> Hi All,
>
> I saw an issue today wherein the producers (new producers) started to fail
> with org.apache.kafka.common.errors.TimeoutException: Failed to update
> metadata after 60000 ms.
>
> This issue happened when we took down one of the 6 brokers (running version
> 0.8.2.1) for planned maintenance (graceful shutdown).
>
> This broker happens to be the last one in the list of 3 brokers that are
> part of bootstrap.servers.
>
> As per my understanding, the producers should have used the other two
> brokers in the bootstrap.servers list for metadata calls. But this did not
> happen.
>
> Is there any producer property that could have caused this? Any way to
> figure out which broker is being used by producers for metadata calls?
>
> Thanks,
> Kris
>

-- 
Thanks,
Ewen

Re: new producer failed with org.apache.kafka.common.errors.TimeoutException

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.

Kris,

This is a bit surprising, but handling the bootstrap servers, broker
failures/retirement, and cluster metadata properly is surprisingly hard to
get right!

https://issues.apache.org/jira/browse/KAFKA-1843 explains some of the
challenges. https://issues.apache.org/jira/browse/KAFKA-3068 shows the
types of issues that can result from trying to better recover from failures
or your situation of graceful shutdown.

I think https://issues.apache.org/jira/browse/KAFKA-2459 might have
addressed the incorrect behavior you are seeing in 0.8.2.1 -- the same
bootstrap broker could be selected due to incorrect handling of
backoff/timeouts. I can't be sure without more info, but it sounds like it
could be the same issue. Despite part of the fix being rolled back due to
KAFKA-3068, I think the relevant part which fixes the timeouts should still
be present in 0.9.0.1. If you can easily reproduce, could you test if the
newer release fixes the issue for you?

-Ewen

On Mon, Feb 22, 2016 at 9:37 PM, Kris K <sq...@gmail.com> wrote:

> Hi All,
>
> I saw an issue today wherein the producers (new producers) started to fail
> with org.apache.kafka.common.errors.TimeoutException: Failed to update
> metadata after 60000 ms.
>
> This issue happened when we took down one of the 6 brokers (running version
> 0.8.2.1) for planned maintenance (graceful shutdown).
>
> This broker happens to be the last one in the list of 3 brokers that are
> part of bootstrap.servers.
>
> As per my understanding, the producers should have used the other two
> brokers in the bootstrap.servers list for metadata calls. But this did not
> happen.
>
> Is there any producer property that could have caused this? Any way to
> figure out which broker is being used by producers for metadata calls?
>
> Thanks,
> Kris
>

-- 
Thanks,
Ewen