You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Edward Capriolo <ed...@gmail.com> on 2022/01/23 12:51:30 UTC

Discuss: outage impact of not updating kafka clients to a version that has KAFKA-9893

Hello,
The bulk of this thread is to discuss
[KAFKA-9893] Configurable TCP connection timeout and improve the initial
metadata fetch - ASF JIRA (apache.org)
<https://issues.apache.org/jira/browse/KAFKA-9893>. IMHO t it should be
considered a BUG FIX and be potentially backported, but others may
disagree. Let me tell you about my environment:

We run kafka 2.2.X 2.5.x and even 2.6.x. We have clients using
spark-streaming, spring bootdata kafka, and folks just using Kafka producer
directly. clusters 3-12 nodes, 10 topics 48 partitions.

We actually have a chaos monkey in our UAT environment that shuts down
brokers and  entire datacenters/racks of brokers,we have clusters each day,
but simply shutting down brokers does not produce the problem.

We observed: There is a huge distinction between kafka broker
being down and host being up, and kafka being down and *host being down.*

Before kafka-9893 the second case is handled poorly. If you look at how
the meta-data connection works it randomly picks hosts from the list, and
sometimes requires 2 random hosts for round trip operations.

Here is what we did. We had all our apps, spark streaming, spring
boot etc. We went and we shutoff a server physically (pick host 1 in the
metadata boker list and shut it down physical hardware). spark streaming
just
could not go forward getting frequent timeouts and tasks failing. (this may
be due to our number of topics and partitions 7 topics 48 partitions) not
sure.

The fix is pretty simple. Even latest spark streaming is still only
looking at kafka-clients-2.6.0.  We simply updated the kafka-clients
artifact (maven) to 2.7.2 and set the timeout to something like 3 seconds
and the process runs while node down. The good news is that generally
kafka-clients seems
backwards compatible and even things compiled against kafka-clients 2.0 do
not seem to have a problem having kafka-clients swapped in at runtime.

The other mitigation we are doing is we are introducing a gslb based round
robin load balancers in here (using DNS), we assume this will work well,
but honestly it somewhat defeats the purpose metadata.broker.list.

Recap:  IMHO based on what I have seen I would advise
everyone to update their clients to 2.7.2 and set the timeout defined in
the jira, based on what I have seen (but your mileage may vary). And since
you're probably patching log4j now anyway might as well just update kafka
deps at the same time.

Please discuss if others have seen this issue, or if this is only something
that affects me.

Re: Discuss: outage impact of not updating kafka clients to a version that has KAFKA-9893

Posted by Edward Capriolo <ed...@gmail.com>.

On Sun, Jan 23, 2022 at 7:51 AM Edward Capriolo <ed...@gmail.com>
wrote:

> Hello,
> The bulk of this thread is to discuss
> [KAFKA-9893] Configurable TCP connection timeout and improve the initial
> metadata fetch - ASF JIRA (apache.org)
> <https://issues.apache.org/jira/browse/KAFKA-9893>. IMHO t it should be
> considered a BUG FIX and be potentially backported, but others may
> disagree. Let me tell you about my environment:
>
> We run kafka 2.2.X 2.5.x and even 2.6.x. We have clients using
> spark-streaming, spring bootdata kafka, and folks just using Kafka producer
> directly. clusters 3-12 nodes, 10 topics 48 partitions.
>
> We actually have a chaos monkey in our UAT environment that shuts down
> brokers and  entire datacenters/racks of brokers,we have clusters each
> day, but simply shutting down brokers does not produce the problem.
>
> We observed: There is a huge distinction between kafka broker
> being down and host being up, and kafka being down and *host being down.*
>
> Before kafka-9893 the second case is handled poorly. If you look at how
> the meta-data connection works it randomly picks hosts from the list, and
> sometimes requires 2 random hosts for round trip operations.
>
> Here is what we did. We had all our apps, spark streaming, spring
> boot etc. We went and we shutoff a server physically (pick host 1 in the
> metadata boker list and shut it down physical hardware). spark streaming
> just
> could not go forward getting frequent timeouts and tasks failing. (this may
> be due to our number of topics and partitions 7 topics 48 partitions) not
> sure.
>
> The fix is pretty simple. Even latest spark streaming is still only
> looking at kafka-clients-2.6.0.  We simply updated the kafka-clients
> artifact (maven) to 2.7.2 and set the timeout to something like 3 seconds
> and the process runs while node down. The good news is that generally
> kafka-clients seems
> backwards compatible and even things compiled against kafka-clients 2.0 do
> not seem to have a problem having kafka-clients swapped in at runtime.
>
> The other mitigation we are doing is we are introducing a gslb based round
> robin load balancers in here (using DNS), we assume this will work well,
> but honestly it somewhat defeats the purpose metadata.broker.list.
>
> Recap:  IMHO based on what I have seen I would advise
> everyone to update their clients to 2.7.2 and set the timeout defined in
> the jira, based on what I have seen (but your mileage may vary). And since
> you're probably patching log4j now anyway might as well just update kafka
> deps at the same time.
>
> Please discuss if others have seen this issue, or if this is only something
> that affects me.
>

Also note to be clear: Upgrading to 2.7.2 alone does not fix the issue. We
had to set the timeout property to something lower than the default (3
seconds) . The issue happens because if the OS is on but kafka is down TCP
reply from port closed is fasterm but if the host is down client settings
regarding connection timeouts are a big factor.