You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Dimitar Dimitrov (Jira)" <ji...@apache.org> on 2023/06/02 22:07:00 UTC

[jira] [Updated] (KAFKA-15052) Fix flaky test QuorumControllerTest.testBalancePartitionLeaders()

     [ https://issues.apache.org/jira/browse/KAFKA-15052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dimitar Dimitrov updated KAFKA-15052:
-------------------------------------
    Description: 
Test failed at [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/1892/tests/] as well as in various local runs.

The test creates a topic, fences a broker, notes partition imbalance due to another broker taking over the partition the fenced broker lost, re-registers and unfences the fenced broker, sends {{AlterPartition}} for the lost partition adding the now unfenced broker back to its ISR, then waits for the partition imbalance to disappear.

The local failures seem to happen when the brokers (including the ones that never get fenced by the test) accidentally get fenced by losing their session due to reaching the (aggressively low for test purposes) session timeout.
The Cloudbees failure quoted above also seems to indicate that this happened:
{code:java}
...[truncated 738209 chars]...
23. (org.apache.kafka.controller.QuorumController:768)
[2023-06-02 18:17:22,202] DEBUG [QuorumController id=0] Scheduling write event for maybeBalancePartitionLeaders because scheduled (DEFERRED), checkIntervalNs (OptionalLong[1000000000]) and isImbalanced (true) (org.apache.kafka.controller.QuorumController:1401)
[2023-06-02 18:17:22,202] INFO [QuorumController id=0] Fencing broker 2 because its session has timed out. (org.apache.kafka.controller.ReplicationControlManager:1459)
[2023-06-02 18:17:22,203] DEBUG [QuorumController id=0] handleBrokerFenced: changing partition(s): foo-0, foo-1, foo-2 (org.apache.kafka.controller.ReplicationControlManager:1750)
[2023-06-02 18:17:22,203] DEBUG [QuorumController id=0] partition change for foo-0 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 2 -> -1, leaderEpoch: 2 -> 3, partitionEpoch: 2 -> 3 (org.apache.kafka.controller.ReplicationControlManager:157)
[2023-06-02 18:17:22,204] DEBUG [QuorumController id=0] partition change for foo-1 with topic ID 033_QSX7TfitL4SDzoeR4w: isr: [2, 3] -> [3], leaderEpoch: 3 -> 4, partitionEpoch: 4 -> 5 (org.apache.kafka.controller.ReplicationControlManager:157)
[2023-06-02 18:17:22,204] DEBUG [QuorumController id=0] partition change for foo-2 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 2 -> -1, leaderEpoch: 2 -> 3, partitionEpoch: 2 -> 3 (org.apache.kafka.controller.ReplicationControlManager:157)
[2023-06-02 18:17:22,205] DEBUG append(batch=LocalRecordBatch(leaderEpoch=1, appendTimestamp=240, records=[ApiMessageAndVersion(PartitionChangeRecord(partitionId=0, topicId=033_QSX7TfitL4SDzoeR4w, isr=null, leader=-1, replicas=null, removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at version 0), ApiMessageAndVersion(PartitionChangeRecord(partitionId=1, topicId=033_QSX7TfitL4SDzoeR4w, isr=[3], leader=3, replicas=null, removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at version 0), ApiMessageAndVersion(PartitionChangeRecord(partitionId=2, topicId=033_QSX7TfitL4SDzoeR4w, isr=null, leader=-1, replicas=null, removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at version 0), ApiMessageAndVersion(BrokerRegistrationChangeRecord(brokerId=2, brokerEpoch=3, fenced=1, inControlledShutdown=0) at version 0)]), prevOffset=27) (org.apache.kafka.metalog.LocalLogManager$SharedLogData:253)
[2023-06-02 18:17:22,205] DEBUG [QuorumController id=0] Creating in-memory snapshot 27 (org.apache.kafka.timeline.SnapshotRegistry:197)
[2023-06-02 18:17:22,205] DEBUG [LocalLogManager 0] Node 0: running log check. (org.apache.kafka.metalog.LocalLogManager:512)
[2023-06-02 18:17:22,205] DEBUG [QuorumController id=0] Read-write operation maybeFenceReplicas(451616131) will be completed when the log reaches offset 27. (org.apache.kafka.controller.QuorumController:768)
[2023-06-02 18:17:22,208] INFO [QuorumController id=0] Fencing broker 3 because its session has timed out. (org.apache.kafka.controller.ReplicationControlManager:1459)
[2023-06-02 18:17:22,209] DEBUG [QuorumController id=0] handleBrokerFenced: changing partition(s): foo-1 (org.apache.kafka.controller.ReplicationControlManager:1750)
[2023-06-02 18:17:22,209] DEBUG [QuorumController id=0] partition change for foo-1 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 3 -> -1, leaderEpoch: 4 -> 5, partitionEpoch: 5 -> 6 (org.apache.kafka.controller.ReplicationControlManager:157){code}

  was:
Test failed at [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/1892/tests/] as well as in various local runs.

The test creates a topic, fences a broker, notes partition imbalance due to another broker taking over the partition the fenced broker lost, re-registers and unfences the fenced broker, sends `AlterPartition` for the lost partition adding the now unfenced broker back to its ISR, then waits for the partition imbalance to disappear.

The local failures seem to happen when the brokers (including the ones that never get fenced by the test) accidentally get fenced by losing their session due to reaching the (aggressively low for test purposes) session timeout.
The Cloudbees failure quoted above also seems to indicate that this happened:
{code:java}
...[truncated 738209 chars]...
23. (org.apache.kafka.controller.QuorumController:768)
[2023-06-02 18:17:22,202] DEBUG [QuorumController id=0] Scheduling write event for maybeBalancePartitionLeaders because scheduled (DEFERRED), checkIntervalNs (OptionalLong[1000000000]) and isImbalanced (true) (org.apache.kafka.controller.QuorumController:1401)
[2023-06-02 18:17:22,202] INFO [QuorumController id=0] Fencing broker 2 because its session has timed out. (org.apache.kafka.controller.ReplicationControlManager:1459)
[2023-06-02 18:17:22,203] DEBUG [QuorumController id=0] handleBrokerFenced: changing partition(s): foo-0, foo-1, foo-2 (org.apache.kafka.controller.ReplicationControlManager:1750)
[2023-06-02 18:17:22,203] DEBUG [QuorumController id=0] partition change for foo-0 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 2 -> -1, leaderEpoch: 2 -> 3, partitionEpoch: 2 -> 3 (org.apache.kafka.controller.ReplicationControlManager:157)
[2023-06-02 18:17:22,204] DEBUG [QuorumController id=0] partition change for foo-1 with topic ID 033_QSX7TfitL4SDzoeR4w: isr: [2, 3] -> [3], leaderEpoch: 3 -> 4, partitionEpoch: 4 -> 5 (org.apache.kafka.controller.ReplicationControlManager:157)
[2023-06-02 18:17:22,204] DEBUG [QuorumController id=0] partition change for foo-2 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 2 -> -1, leaderEpoch: 2 -> 3, partitionEpoch: 2 -> 3 (org.apache.kafka.controller.ReplicationControlManager:157)
[2023-06-02 18:17:22,205] DEBUG append(batch=LocalRecordBatch(leaderEpoch=1, appendTimestamp=240, records=[ApiMessageAndVersion(PartitionChangeRecord(partitionId=0, topicId=033_QSX7TfitL4SDzoeR4w, isr=null, leader=-1, replicas=null, removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at version 0), ApiMessageAndVersion(PartitionChangeRecord(partitionId=1, topicId=033_QSX7TfitL4SDzoeR4w, isr=[3], leader=3, replicas=null, removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at version 0), ApiMessageAndVersion(PartitionChangeRecord(partitionId=2, topicId=033_QSX7TfitL4SDzoeR4w, isr=null, leader=-1, replicas=null, removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at version 0), ApiMessageAndVersion(BrokerRegistrationChangeRecord(brokerId=2, brokerEpoch=3, fenced=1, inControlledShutdown=0) at version 0)]), prevOffset=27) (org.apache.kafka.metalog.LocalLogManager$SharedLogData:253)
[2023-06-02 18:17:22,205] DEBUG [QuorumController id=0] Creating in-memory snapshot 27 (org.apache.kafka.timeline.SnapshotRegistry:197)
[2023-06-02 18:17:22,205] DEBUG [LocalLogManager 0] Node 0: running log check. (org.apache.kafka.metalog.LocalLogManager:512)
[2023-06-02 18:17:22,205] DEBUG [QuorumController id=0] Read-write operation maybeFenceReplicas(451616131) will be completed when the log reaches offset 27. (org.apache.kafka.controller.QuorumController:768)
[2023-06-02 18:17:22,208] INFO [QuorumController id=0] Fencing broker 3 because its session has timed out. (org.apache.kafka.controller.ReplicationControlManager:1459)
[2023-06-02 18:17:22,209] DEBUG [QuorumController id=0] handleBrokerFenced: changing partition(s): foo-1 (org.apache.kafka.controller.ReplicationControlManager:1750)
[2023-06-02 18:17:22,209] DEBUG [QuorumController id=0] partition change for foo-1 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 3 -> -1, leaderEpoch: 4 -> 5, partitionEpoch: 5 -> 6 (org.apache.kafka.controller.ReplicationControlManager:157){code}


> Fix flaky test QuorumControllerTest.testBalancePartitionLeaders()
> -----------------------------------------------------------------
>
>                 Key: KAFKA-15052
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15052
>             Project: Kafka
>          Issue Type: Test
>            Reporter: Dimitar Dimitrov
>            Assignee: Dimitar Dimitrov
>            Priority: Major
>
> Test failed at [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/1892/tests/] as well as in various local runs.
> The test creates a topic, fences a broker, notes partition imbalance due to another broker taking over the partition the fenced broker lost, re-registers and unfences the fenced broker, sends {{AlterPartition}} for the lost partition adding the now unfenced broker back to its ISR, then waits for the partition imbalance to disappear.
> The local failures seem to happen when the brokers (including the ones that never get fenced by the test) accidentally get fenced by losing their session due to reaching the (aggressively low for test purposes) session timeout.
> The Cloudbees failure quoted above also seems to indicate that this happened:
> {code:java}
> ...[truncated 738209 chars]...
> 23. (org.apache.kafka.controller.QuorumController:768)
> [2023-06-02 18:17:22,202] DEBUG [QuorumController id=0] Scheduling write event for maybeBalancePartitionLeaders because scheduled (DEFERRED), checkIntervalNs (OptionalLong[1000000000]) and isImbalanced (true) (org.apache.kafka.controller.QuorumController:1401)
> [2023-06-02 18:17:22,202] INFO [QuorumController id=0] Fencing broker 2 because its session has timed out. (org.apache.kafka.controller.ReplicationControlManager:1459)
> [2023-06-02 18:17:22,203] DEBUG [QuorumController id=0] handleBrokerFenced: changing partition(s): foo-0, foo-1, foo-2 (org.apache.kafka.controller.ReplicationControlManager:1750)
> [2023-06-02 18:17:22,203] DEBUG [QuorumController id=0] partition change for foo-0 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 2 -> -1, leaderEpoch: 2 -> 3, partitionEpoch: 2 -> 3 (org.apache.kafka.controller.ReplicationControlManager:157)
> [2023-06-02 18:17:22,204] DEBUG [QuorumController id=0] partition change for foo-1 with topic ID 033_QSX7TfitL4SDzoeR4w: isr: [2, 3] -> [3], leaderEpoch: 3 -> 4, partitionEpoch: 4 -> 5 (org.apache.kafka.controller.ReplicationControlManager:157)
> [2023-06-02 18:17:22,204] DEBUG [QuorumController id=0] partition change for foo-2 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 2 -> -1, leaderEpoch: 2 -> 3, partitionEpoch: 2 -> 3 (org.apache.kafka.controller.ReplicationControlManager:157)
> [2023-06-02 18:17:22,205] DEBUG append(batch=LocalRecordBatch(leaderEpoch=1, appendTimestamp=240, records=[ApiMessageAndVersion(PartitionChangeRecord(partitionId=0, topicId=033_QSX7TfitL4SDzoeR4w, isr=null, leader=-1, replicas=null, removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at version 0), ApiMessageAndVersion(PartitionChangeRecord(partitionId=1, topicId=033_QSX7TfitL4SDzoeR4w, isr=[3], leader=3, replicas=null, removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at version 0), ApiMessageAndVersion(PartitionChangeRecord(partitionId=2, topicId=033_QSX7TfitL4SDzoeR4w, isr=null, leader=-1, replicas=null, removingReplicas=null, addingReplicas=null, leaderRecoveryState=-1) at version 0), ApiMessageAndVersion(BrokerRegistrationChangeRecord(brokerId=2, brokerEpoch=3, fenced=1, inControlledShutdown=0) at version 0)]), prevOffset=27) (org.apache.kafka.metalog.LocalLogManager$SharedLogData:253)
> [2023-06-02 18:17:22,205] DEBUG [QuorumController id=0] Creating in-memory snapshot 27 (org.apache.kafka.timeline.SnapshotRegistry:197)
> [2023-06-02 18:17:22,205] DEBUG [LocalLogManager 0] Node 0: running log check. (org.apache.kafka.metalog.LocalLogManager:512)
> [2023-06-02 18:17:22,205] DEBUG [QuorumController id=0] Read-write operation maybeFenceReplicas(451616131) will be completed when the log reaches offset 27. (org.apache.kafka.controller.QuorumController:768)
> [2023-06-02 18:17:22,208] INFO [QuorumController id=0] Fencing broker 3 because its session has timed out. (org.apache.kafka.controller.ReplicationControlManager:1459)
> [2023-06-02 18:17:22,209] DEBUG [QuorumController id=0] handleBrokerFenced: changing partition(s): foo-1 (org.apache.kafka.controller.ReplicationControlManager:1750)
> [2023-06-02 18:17:22,209] DEBUG [QuorumController id=0] partition change for foo-1 with topic ID 033_QSX7TfitL4SDzoeR4w: leader: 3 -> -1, leaderEpoch: 4 -> 5, partitionEpoch: 5 -> 6 (org.apache.kafka.controller.ReplicationControlManager:157){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)