You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by "Jun Yao (JIRA)" <ji...@apache.org> on 2017/02/05 19:03:42 UTC

[jira] [Created] (KAFKA-4736) producer failed too slow when meta request failed

Jun Yao created KAFKA-4736:
------------------------------

             Summary: producer failed too slow when meta request failed
                 Key: KAFKA-4736
                 URL: https://issues.apache.org/jira/browse/KAFKA-4736
             Project: Kafka
          Issue Type: Bug
          Components: producer 
            Reporter: Jun Yao


This might be similar to https://issues.apache.org/jira/browse/KAFKA-4385 but happen in a different case. 
In some cases as tested, the producer may get some invalid metadata (and it's always invalid when there are some issues on the broker side), 

so whenever calling KafkaProducer.send(), it will spent 60seconds (in default configuration) on KafkaProducer.waitOnMetadata() and then throw TimeoutException("Failed to update metadata after 60000 ms"),

so when there are something wrong on some topic that the producer did not get the metadata of the topic, it will be like 'blocked' by this topic for 60seconds,  and impacting other topics sending. 

for cases that we want to utilizing the "Callback" to save those failed requests data in a different place and retry later, the Callback is also called every 60seconds. so if upstream is keep receiving data and calling producer.send(), 
it will soon be blocking or buffering too much in memory if upstream has a buffer of data before calling producer.send. 

It looks to me the KafkaProducer.send() is failing too slow (not fail fast) when something is wrong on some topic/broker.  and it's always in this slow failure state. 

I am not sure if reducing the "max.block.ms" is the right way to avoid this, since when meta data changes it will need some time to get updated metadata; and if auto topic creation it will also need enough time to wait for the topic created 

as from [~sslavic]'s comment, 
I am wondering if a better way of defining and utilizing RetriableException will help on this, maybe need some support from server side, that some exception is not retriable so client side would not need to waste the time to keep retrying. 

Or maybe consider my proposal on https://issues.apache.org/jira/browse/KAFKA-4385 to have another config to limit the consecutive failures on one topic. 

or maybe some adaptive behavior that the block time will be decreased after some consecutive failures. 









--
This message was sent by Atlassian JIRA
(v6.3.15#6346)