You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Wenhao Ji (Jira)" <ji...@apache.org> on 2022/01/26 08:39:00 UTC
[jira] [Commented] (KAFKA-7572) Producer should not send requests with negative partition id
[ https://issues.apache.org/jira/browse/KAFKA-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482323#comment-17482323 ]
Wenhao Ji commented on KAFKA-7572:
----------------------------------
[~guozhang] Would you mind reviewing the pr [#10525|https://github.com/apache/kafka/pull/10525] for me?
> Producer should not send requests with negative partition id
> ------------------------------------------------------------
>
> Key: KAFKA-7572
> URL: https://issues.apache.org/jira/browse/KAFKA-7572
> Project: Kafka
> Issue Type: Bug
> Components: clients
> Affects Versions: 1.0.1, 1.1.1
> Reporter: Yaodong Yang
> Assignee: Wenhao Ji
> Priority: Major
> Labels: patch-available
>
> h3. Issue:
> In one Kafka producer log from our users, we found the following weird one:
> timestamp="2018-10-09T17:37:41,237-0700",level="ERROR", Message="Write to Kafka failed with: ",exception="java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for topicName--2: 30042 ms has passed since batch creation plus linger time
> at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:94)
> at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:64)
> at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:29)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for topicName--2: 30042 ms has passed since batch creation plus linger time"
> After a few hours debugging, we finally understood the root cause of this issue:
> # The producer used a buggy custom Partitioner, which sometimes generates negative partition ids for new records.
> # The corresponding produce requests were rejected by brokers, because it's illegal to have a partition with a negative id.
> # The client kept refreshing its local cluster metadata, but could not send produce requests successfully.
> # From the above log, we found a suspicious string "topicName--2":
> # According to the source code, the format of this string in the log is TopicName+"-"+PartitionId.
> # It's not easy to notice that there were 2 consecutive dash in the above log.
> # Eventually, we found that the second dash was a negative sign. Therefore, the partition id is -2, rather than 2.
> # The bug the custom Partitioner.
> h3. Proposal:
> # Producer code should check the partitionId before sending requests to brokers.
> # If there is a negative partition Id, just throw an IllegalStateException{{ }}exception.
> # Such a quick check can save lots of time for people debugging their producer code.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)