You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Philip Nee (Jira)" <ji...@apache.org> on 2022/09/05 03:34:00 UTC

[jira] [Updated] (KAFKA-14196) Flaky OffsetValidationTest seems to indicate potential duplication issue during rebalance

     [ https://issues.apache.org/jira/browse/KAFKA-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Philip Nee updated KAFKA-14196:
-------------------------------
    Description: 
Several flaky tests under OffsetValidationTest are indicating potential consumer duplication issue, when autocommit is enabled.  Below shows the failure message:

 
{code:java}
Total consumed records 3366 did not match consumed position 3331 {code}
 

After investigating the log, I discovered that the data consumed between the start of a rebalance event and the async commit was lost for those failing tests.  In the example below, the rebalance event kicks in at around 1662054846995 (first record), and the async commit of the offset 3739 is completed at around 1662054847015 (right before partitions_revoked).

 
{code:java}
{"timestamp":1662054846995,"name":"records_consumed","count":3,"partitions":[{"topic":"test_topic","partition":0,"count":3,"minOffset":3739,"maxOffset":3741}]}
{"timestamp":1662054846998,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3742,"maxOffset":3743}]}
{"timestamp":1662054847008,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3744,"maxOffset":3745}]}
{"timestamp":1662054847016,"name":"partitions_revoked","partitions":[{"topic":"test_topic","partition":0}]}
{"timestamp":1662054847031,"name":"partitions_assigned","partitions":[{"topic":"test_topic","partition":0}]}
{"timestamp":1662054847038,"name":"records_consumed","count":23,"partitions":[{"topic":"test_topic","partition":0,"count":23,"minOffset":3739,"maxOffset":3761}]} {code}
A few things to note here:
 # Manually calling commitSync in the onPartitionsRevoke cb seems to alleviate the issue
 # Setting includeMetadataInTimeout to false also seems to alleviate the issue.

The above tries seems to suggest that contract between poll() and asyncCommit() is broken.  AFAIK, we implicitly uses poll() as the ack of the previously fetched data, and the consumer would (try to) commit these offsets in the current poll() loop.  However, it seems like as the poll continues to loop, the "acked" data isn't being committed.

 

After the investigation, I believe, 

 

  was:
Several flaky tests under OffsetValidationTest are indicating potential consumer duplication issue, when autocommit is enabled.  Below shows the failure message:

 
{code:java}
Total consumed records 3366 did not match consumed position 3331 {code}
 

After investigating the log, I discovered that the data consumed between the start of a rebalance event and the async commit was lost for those failing tests.  In the example below, the rebalance event kicks in at around 1662054846995 (first record), and the async commit of the offset 3739 is completed at around 1662054847015 (right before partitions_revoked).

 
{code:java}
{"timestamp":1662054846995,"name":"records_consumed","count":3,"partitions":[{"topic":"test_topic","partition":0,"count":3,"minOffset":3739,"maxOffset":3741}]}
{"timestamp":1662054846998,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3742,"maxOffset":3743}]}
{"timestamp":1662054847008,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3744,"maxOffset":3745}]}
{"timestamp":1662054847016,"name":"partitions_revoked","partitions":[{"topic":"test_topic","partition":0}]}
{"timestamp":1662054847031,"name":"partitions_assigned","partitions":[{"topic":"test_topic","partition":0}]}
{"timestamp":1662054847038,"name":"records_consumed","count":23,"partitions":[{"topic":"test_topic","partition":0,"count":23,"minOffset":3739,"maxOffset":3761}]} {code}
A few things to note here:
 # This is highly flaky, I found 1/4 runs will fail the tests
 # Manually calling commitSync in the onPartitionsRevoke cb seems to alleviate the issue
 # Setting includeMetadataInTimeout to false also seems to alleviate the issue.

 


> Flaky OffsetValidationTest seems to indicate potential duplication issue during rebalance
> -----------------------------------------------------------------------------------------
>
>                 Key: KAFKA-14196
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14196
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer
>    Affects Versions: 3.2.1
>            Reporter: Philip Nee
>            Assignee: Philip Nee
>            Priority: Major
>
> Several flaky tests under OffsetValidationTest are indicating potential consumer duplication issue, when autocommit is enabled.  Below shows the failure message:
>  
> {code:java}
> Total consumed records 3366 did not match consumed position 3331 {code}
>  
> After investigating the log, I discovered that the data consumed between the start of a rebalance event and the async commit was lost for those failing tests.  In the example below, the rebalance event kicks in at around 1662054846995 (first record), and the async commit of the offset 3739 is completed at around 1662054847015 (right before partitions_revoked).
>  
> {code:java}
> {"timestamp":1662054846995,"name":"records_consumed","count":3,"partitions":[{"topic":"test_topic","partition":0,"count":3,"minOffset":3739,"maxOffset":3741}]}
> {"timestamp":1662054846998,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3742,"maxOffset":3743}]}
> {"timestamp":1662054847008,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3744,"maxOffset":3745}]}
> {"timestamp":1662054847016,"name":"partitions_revoked","partitions":[{"topic":"test_topic","partition":0}]}
> {"timestamp":1662054847031,"name":"partitions_assigned","partitions":[{"topic":"test_topic","partition":0}]}
> {"timestamp":1662054847038,"name":"records_consumed","count":23,"partitions":[{"topic":"test_topic","partition":0,"count":23,"minOffset":3739,"maxOffset":3761}]} {code}
> A few things to note here:
>  # Manually calling commitSync in the onPartitionsRevoke cb seems to alleviate the issue
>  # Setting includeMetadataInTimeout to false also seems to alleviate the issue.
> The above tries seems to suggest that contract between poll() and asyncCommit() is broken.  AFAIK, we implicitly uses poll() as the ack of the previously fetched data, and the consumer would (try to) commit these offsets in the current poll() loop.  However, it seems like as the poll continues to loop, the "acked" data isn't being committed.
>  
> After the investigation, I believe, 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)