You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by "John Fung (JIRA)" <ji...@apache.org> on 2012/09/14 21:39:09 UTC

[jira] [Created] (KAFKA-514) Replication with Leader Failure Test: Log segment files checksum mismatch

John Fung created KAFKA-514:
-------------------------------

             Summary: Replication with Leader Failure Test: Log segment files checksum mismatch
                 Key: KAFKA-514
                 URL: https://issues.apache.org/jira/browse/KAFKA-514
             Project: Kafka
          Issue Type: Bug
            Reporter: John Fung


Test Description:

   1. Produce and consume messages to 1 topics and 3 partitions.
   2. This test sends 10 messages every 2 sec to 3 replicas.
   3. At the end verifies the log size and contents as well as using a consumer to verify that there is no message loss.

The issue:
When the leader is terminated by a controlled failure (kill -15), the resulting log segment files size are not all matching. The mismatch log segment size would happen in one of the partition of the terminated broker. This is consistently reproducible from the system regression test for replication with the following configurations:

    * zookeeper: 1-node (local)
    * brokers: 3-node cluster (all local)
    * replica factor: 3
    * no. of topic: 1
    * no. of partition: 2
    * iterations of leader failure: 1

Remarks:

    * It is rarely reproducible if the no. of partitions is 1.
    * Even the file checksums are not matching, the no. of messages in the producer & consumer logs are equal


Test result (shown with log file checksum):

broker-1 :
test_1-0/00000000000000000000.kafka => 1690639555
test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas

broker-2 :
test_1-0/00000000000000000000.kafka => 1690639555
test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas

broker-3 :
test_1-0/00000000000000000000.kafka => 1690639555
test_1-1/00000000000000000000.kafka => 3530842923    <<<< not matching across all replicas

Errors:
The following error is found in the terminated leader:

[2012-09-14 11:07:05,217] WARN No previously checkpointed highwatermark value found for topic test_1 partition 1. Returning 0 as the highwatermark (kafka.server.HighwaterMarkCheckpoint)
[2012-09-14 11:07:05,220] ERROR Replica Manager on Broker 3: Error processing leaderAndISR request LeaderAndIsrRequest(1,,true,1000,Map((test_1,1) -> { "ISR": "1,2","leader": "1","leaderEpoch": "0" }, (test_1,0) -> { "ISR": "
1,2","leader": "1","leaderEpoch": "1" })) (kafka.server.ReplicaManager)
kafka.common.KafkaException: End index must be segment list size - 1
        at kafka.log.SegmentList.truncLast(SegmentList.scala:82)
        at kafka.log.Log.truncateTo(Log.scala:471)
        at kafka.cluster.Partition.makeFollower(Partition.scala:171)
        at kafka.cluster.Partition.makeLeaderOrFollower(Partition.scala:126)
        at kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeFollower(ReplicaManager.scala:195)
        at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:154)
        at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:144)
        at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
        at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
        at scala.collection.Iterator$class.foreach(Iterator.scala:631)
        at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161)
        at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194)
        at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
        at scala.collection.mutable.HashMap.foreach(HashMap.scala:80)
        at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:144)
        at kafka.server.KafkaApis.handleLeaderAndISRRequest(KafkaApis.scala:73)
        at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
        at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:40)
        at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-514) Replication with Leader Failure Test: Log segment files checksum mismatch

Posted by "John Fung (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/KAFKA-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Fung updated KAFKA-514:
----------------------------

    Attachment: system_test_output_archive.tar.gz
    
> Replication with Leader Failure Test: Log segment files checksum mismatch
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-514
>                 URL: https://issues.apache.org/jira/browse/KAFKA-514
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: John Fung
>            Priority: Blocker
>              Labels: replication-testing
>             Fix For: 0.8
>
>         Attachments: system_test_output_archive.tar.gz, testcase_2.tar
>
>
> Test Description:
>    1. Produce and consume messages to 1 topics and 3 partitions.
>    2. This test sends 10 messages every 2 sec to 3 replicas.
>    3. At the end verifies the log size and contents as well as using a consumer to verify that there is no message loss.
> The issue:
> When the leader is terminated by a controlled failure (kill -15), the resulting log segment files size are not all matching. The mismatch log segment size would happen in one of the partition of the terminated broker. This is consistently reproducible from the system regression test for replication with the following configurations:
>     * zookeeper: 1-node (local)
>     * brokers: 3-node cluster (all local)
>     * replica factor: 3
>     * no. of topic: 1
>     * no. of partition: 2
>     * iterations of leader failure: 1
> Remarks:
>     * It is rarely reproducible if the no. of partitions is 1.
>     * Even the file checksums are not matching, the no. of messages in the producer & consumer logs are equal
> Test result (shown with log file checksum):
> broker-1 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-2 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-3 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 3530842923    <<<< not matching across all replicas
> Errors:
> The following error is found in the terminated leader:
> [2012-09-14 11:07:05,217] WARN No previously checkpointed highwatermark value found for topic test_1 partition 1. Returning 0 as the highwatermark (kafka.server.HighwaterMarkCheckpoint)
> [2012-09-14 11:07:05,220] ERROR Replica Manager on Broker 3: Error processing leaderAndISR request LeaderAndIsrRequest(1,,true,1000,Map((test_1,1) -> { "ISR": "1,2","leader": "1","leaderEpoch": "0" }, (test_1,0) -> { "ISR": "
> 1,2","leader": "1","leaderEpoch": "1" })) (kafka.server.ReplicaManager)
> kafka.common.KafkaException: End index must be segment list size - 1
>         at kafka.log.SegmentList.truncLast(SegmentList.scala:82)
>         at kafka.log.Log.truncateTo(Log.scala:471)
>         at kafka.cluster.Partition.makeFollower(Partition.scala:171)
>         at kafka.cluster.Partition.makeLeaderOrFollower(Partition.scala:126)
>         at kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeFollower(ReplicaManager.scala:195)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:154)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:144)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:631)
>         at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161)
>         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:80)
>         at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:144)
>         at kafka.server.KafkaApis.handleLeaderAndISRRequest(KafkaApis.scala:73)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:40)
>         at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-514) Replication with Leader Failure Test: Log segment files checksum mismatch

Posted by "John Fung (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/KAFKA-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Fung updated KAFKA-514:
----------------------------

    Attachment: kafka-514-reproduce-issue.patch
    
> Replication with Leader Failure Test: Log segment files checksum mismatch
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-514
>                 URL: https://issues.apache.org/jira/browse/KAFKA-514
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: John Fung
>            Priority: Blocker
>              Labels: replication-testing
>             Fix For: 0.8
>
>         Attachments: kafka-514-reproduce-issue.patch, system_test_output_archive.tar.gz, testcase_2.tar
>
>
> Test Description:
>    1. Produce and consume messages to 1 topics and 3 partitions.
>    2. This test sends 10 messages every 2 sec to 3 replicas.
>    3. At the end verifies the log size and contents as well as using a consumer to verify that there is no message loss.
> The issue:
> When the leader is terminated by a controlled failure (kill -15), the resulting log segment files size are not all matching. The mismatch log segment size would happen in one of the partition of the terminated broker. This is consistently reproducible from the system regression test for replication with the following configurations:
>     * zookeeper: 1-node (local)
>     * brokers: 3-node cluster (all local)
>     * replica factor: 3
>     * no. of topic: 1
>     * no. of partition: 2
>     * iterations of leader failure: 1
> Remarks:
>     * It is rarely reproducible if the no. of partitions is 1.
>     * Even the file checksums are not matching, the no. of messages in the producer & consumer logs are equal
> Test result (shown with log file checksum):
> broker-1 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-2 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-3 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 3530842923    <<<< not matching across all replicas
> Errors:
> The following error is found in the terminated leader:
> [2012-09-14 11:07:05,217] WARN No previously checkpointed highwatermark value found for topic test_1 partition 1. Returning 0 as the highwatermark (kafka.server.HighwaterMarkCheckpoint)
> [2012-09-14 11:07:05,220] ERROR Replica Manager on Broker 3: Error processing leaderAndISR request LeaderAndIsrRequest(1,,true,1000,Map((test_1,1) -> { "ISR": "1,2","leader": "1","leaderEpoch": "0" }, (test_1,0) -> { "ISR": "
> 1,2","leader": "1","leaderEpoch": "1" })) (kafka.server.ReplicaManager)
> kafka.common.KafkaException: End index must be segment list size - 1
>         at kafka.log.SegmentList.truncLast(SegmentList.scala:82)
>         at kafka.log.Log.truncateTo(Log.scala:471)
>         at kafka.cluster.Partition.makeFollower(Partition.scala:171)
>         at kafka.cluster.Partition.makeLeaderOrFollower(Partition.scala:126)
>         at kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeFollower(ReplicaManager.scala:195)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:154)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:144)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:631)
>         at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161)
>         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:80)
>         at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:144)
>         at kafka.server.KafkaApis.handleLeaderAndISRRequest(KafkaApis.scala:73)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:40)
>         at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-514) Replication with Leader Failure Test: Log segment files checksum mismatch

Posted by "John Fung (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/KAFKA-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469833#comment-13469833 ] 

John Fung commented on KAFKA-514:
---------------------------------

This issue can be reproduced as follows:

1. Download the latest 0.8 branch
2. Apply kafka-502-v4.patch
3. Under directory <kafka_home>, execute "./sbt update package" to build Kafka
4. Untar testcase_2.tar to <kafka_home>/system_test/replication_testsuite/
5. Modified <kafka_home>/system_test/testcase_to_run.json from "testcase_1" to "testcase_2"
6. Under directory <kafka_home>/system_test, execute "python -B system_test_runner.py"
                
> Replication with Leader Failure Test: Log segment files checksum mismatch
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-514
>                 URL: https://issues.apache.org/jira/browse/KAFKA-514
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: John Fung
>            Priority: Blocker
>              Labels: replication-testing
>             Fix For: 0.8
>
>         Attachments: system_test_output_archive.tar.gz, testcase_2.tar
>
>
> Test Description:
>    1. Produce and consume messages to 1 topics and 3 partitions.
>    2. This test sends 10 messages every 2 sec to 3 replicas.
>    3. At the end verifies the log size and contents as well as using a consumer to verify that there is no message loss.
> The issue:
> When the leader is terminated by a controlled failure (kill -15), the resulting log segment files size are not all matching. The mismatch log segment size would happen in one of the partition of the terminated broker. This is consistently reproducible from the system regression test for replication with the following configurations:
>     * zookeeper: 1-node (local)
>     * brokers: 3-node cluster (all local)
>     * replica factor: 3
>     * no. of topic: 1
>     * no. of partition: 2
>     * iterations of leader failure: 1
> Remarks:
>     * It is rarely reproducible if the no. of partitions is 1.
>     * Even the file checksums are not matching, the no. of messages in the producer & consumer logs are equal
> Test result (shown with log file checksum):
> broker-1 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-2 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-3 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 3530842923    <<<< not matching across all replicas
> Errors:
> The following error is found in the terminated leader:
> [2012-09-14 11:07:05,217] WARN No previously checkpointed highwatermark value found for topic test_1 partition 1. Returning 0 as the highwatermark (kafka.server.HighwaterMarkCheckpoint)
> [2012-09-14 11:07:05,220] ERROR Replica Manager on Broker 3: Error processing leaderAndISR request LeaderAndIsrRequest(1,,true,1000,Map((test_1,1) -> { "ISR": "1,2","leader": "1","leaderEpoch": "0" }, (test_1,0) -> { "ISR": "
> 1,2","leader": "1","leaderEpoch": "1" })) (kafka.server.ReplicaManager)
> kafka.common.KafkaException: End index must be segment list size - 1
>         at kafka.log.SegmentList.truncLast(SegmentList.scala:82)
>         at kafka.log.Log.truncateTo(Log.scala:471)
>         at kafka.cluster.Partition.makeFollower(Partition.scala:171)
>         at kafka.cluster.Partition.makeLeaderOrFollower(Partition.scala:126)
>         at kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeFollower(ReplicaManager.scala:195)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:154)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:144)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:631)
>         at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161)
>         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:80)
>         at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:144)
>         at kafka.server.KafkaApis.handleLeaderAndISRRequest(KafkaApis.scala:73)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:40)
>         at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-514) Replication with Leader Failure Test: Log segment files checksum mismatch

Posted by "John Fung (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/KAFKA-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Fung updated KAFKA-514:
----------------------------

    Attachment: system_test_output_archive.tar
    
> Replication with Leader Failure Test: Log segment files checksum mismatch
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-514
>                 URL: https://issues.apache.org/jira/browse/KAFKA-514
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: John Fung
>            Priority: Blocker
>              Labels: replication-testing
>             Fix For: 0.8
>
>         Attachments: system_test_output_archive.tar, testcase_2.tar
>
>
> Test Description:
>    1. Produce and consume messages to 1 topics and 3 partitions.
>    2. This test sends 10 messages every 2 sec to 3 replicas.
>    3. At the end verifies the log size and contents as well as using a consumer to verify that there is no message loss.
> The issue:
> When the leader is terminated by a controlled failure (kill -15), the resulting log segment files size are not all matching. The mismatch log segment size would happen in one of the partition of the terminated broker. This is consistently reproducible from the system regression test for replication with the following configurations:
>     * zookeeper: 1-node (local)
>     * brokers: 3-node cluster (all local)
>     * replica factor: 3
>     * no. of topic: 1
>     * no. of partition: 2
>     * iterations of leader failure: 1
> Remarks:
>     * It is rarely reproducible if the no. of partitions is 1.
>     * Even the file checksums are not matching, the no. of messages in the producer & consumer logs are equal
> Test result (shown with log file checksum):
> broker-1 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-2 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-3 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 3530842923    <<<< not matching across all replicas
> Errors:
> The following error is found in the terminated leader:
> [2012-09-14 11:07:05,217] WARN No previously checkpointed highwatermark value found for topic test_1 partition 1. Returning 0 as the highwatermark (kafka.server.HighwaterMarkCheckpoint)
> [2012-09-14 11:07:05,220] ERROR Replica Manager on Broker 3: Error processing leaderAndISR request LeaderAndIsrRequest(1,,true,1000,Map((test_1,1) -> { "ISR": "1,2","leader": "1","leaderEpoch": "0" }, (test_1,0) -> { "ISR": "
> 1,2","leader": "1","leaderEpoch": "1" })) (kafka.server.ReplicaManager)
> kafka.common.KafkaException: End index must be segment list size - 1
>         at kafka.log.SegmentList.truncLast(SegmentList.scala:82)
>         at kafka.log.Log.truncateTo(Log.scala:471)
>         at kafka.cluster.Partition.makeFollower(Partition.scala:171)
>         at kafka.cluster.Partition.makeLeaderOrFollower(Partition.scala:126)
>         at kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeFollower(ReplicaManager.scala:195)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:154)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:144)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:631)
>         at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161)
>         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:80)
>         at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:144)
>         at kafka.server.KafkaApis.handleLeaderAndISRRequest(KafkaApis.scala:73)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:40)
>         at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-514) Replication with Leader Failure Test: Log segment files checksum mismatch

Posted by "John Fung (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/KAFKA-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470790#comment-13470790 ] 

John Fung commented on KAFKA-514:
---------------------------------

Thanks Jun for the fix.

* This testcase consistently failed before your fix.
* After applying the fix:
*    the testcase failed twice and passed once (with full metrics.json mbean launched)
*    the testcase passed twice in a row (with less mbean specified in metrics.json)
                
> Replication with Leader Failure Test: Log segment files checksum mismatch
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-514
>                 URL: https://issues.apache.org/jira/browse/KAFKA-514
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: John Fung
>            Priority: Blocker
>              Labels: replication-testing
>             Fix For: 0.8
>
>         Attachments: kafka-514-reproduce-issue.patch, kafka-514_v1.patch, system_test_output_archive.tar.gz, testcase_2.tar
>
>
> Test Description:
>    1. Produce and consume messages to 1 topics and 3 partitions.
>    2. This test sends 10 messages every 2 sec to 3 replicas.
>    3. At the end verifies the log size and contents as well as using a consumer to verify that there is no message loss.
> The issue:
> When the leader is terminated by a controlled failure (kill -15), the resulting log segment files size are not all matching. The mismatch log segment size would happen in one of the partition of the terminated broker. This is consistently reproducible from the system regression test for replication with the following configurations:
>     * zookeeper: 1-node (local)
>     * brokers: 3-node cluster (all local)
>     * replica factor: 3
>     * no. of topic: 1
>     * no. of partition: 2
>     * iterations of leader failure: 1
> Remarks:
>     * It is rarely reproducible if the no. of partitions is 1.
>     * Even the file checksums are not matching, the no. of messages in the producer & consumer logs are equal
> Test result (shown with log file checksum):
> broker-1 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-2 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-3 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 3530842923    <<<< not matching across all replicas
> Errors:
> The following error is found in the terminated leader:
> [2012-09-14 11:07:05,217] WARN No previously checkpointed highwatermark value found for topic test_1 partition 1. Returning 0 as the highwatermark (kafka.server.HighwaterMarkCheckpoint)
> [2012-09-14 11:07:05,220] ERROR Replica Manager on Broker 3: Error processing leaderAndISR request LeaderAndIsrRequest(1,,true,1000,Map((test_1,1) -> { "ISR": "1,2","leader": "1","leaderEpoch": "0" }, (test_1,0) -> { "ISR": "
> 1,2","leader": "1","leaderEpoch": "1" })) (kafka.server.ReplicaManager)
> kafka.common.KafkaException: End index must be segment list size - 1
>         at kafka.log.SegmentList.truncLast(SegmentList.scala:82)
>         at kafka.log.Log.truncateTo(Log.scala:471)
>         at kafka.cluster.Partition.makeFollower(Partition.scala:171)
>         at kafka.cluster.Partition.makeLeaderOrFollower(Partition.scala:126)
>         at kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeFollower(ReplicaManager.scala:195)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:154)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:144)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:631)
>         at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161)
>         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:80)
>         at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:144)
>         at kafka.server.KafkaApis.handleLeaderAndISRRequest(KafkaApis.scala:73)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:40)
>         at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-514) Replication with Leader Failure Test: Log segment files checksum mismatch

Posted by "Joel Koshy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/KAFKA-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joel Koshy updated KAFKA-514:
-----------------------------

    Labels: replication-testing  (was: )
    
> Replication with Leader Failure Test: Log segment files checksum mismatch
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-514
>                 URL: https://issues.apache.org/jira/browse/KAFKA-514
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: John Fung
>            Priority: Blocker
>              Labels: replication-testing
>             Fix For: 0.8
>
>
> Test Description:
>    1. Produce and consume messages to 1 topics and 3 partitions.
>    2. This test sends 10 messages every 2 sec to 3 replicas.
>    3. At the end verifies the log size and contents as well as using a consumer to verify that there is no message loss.
> The issue:
> When the leader is terminated by a controlled failure (kill -15), the resulting log segment files size are not all matching. The mismatch log segment size would happen in one of the partition of the terminated broker. This is consistently reproducible from the system regression test for replication with the following configurations:
>     * zookeeper: 1-node (local)
>     * brokers: 3-node cluster (all local)
>     * replica factor: 3
>     * no. of topic: 1
>     * no. of partition: 2
>     * iterations of leader failure: 1
> Remarks:
>     * It is rarely reproducible if the no. of partitions is 1.
>     * Even the file checksums are not matching, the no. of messages in the producer & consumer logs are equal
> Test result (shown with log file checksum):
> broker-1 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-2 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-3 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 3530842923    <<<< not matching across all replicas
> Errors:
> The following error is found in the terminated leader:
> [2012-09-14 11:07:05,217] WARN No previously checkpointed highwatermark value found for topic test_1 partition 1. Returning 0 as the highwatermark (kafka.server.HighwaterMarkCheckpoint)
> [2012-09-14 11:07:05,220] ERROR Replica Manager on Broker 3: Error processing leaderAndISR request LeaderAndIsrRequest(1,,true,1000,Map((test_1,1) -> { "ISR": "1,2","leader": "1","leaderEpoch": "0" }, (test_1,0) -> { "ISR": "
> 1,2","leader": "1","leaderEpoch": "1" })) (kafka.server.ReplicaManager)
> kafka.common.KafkaException: End index must be segment list size - 1
>         at kafka.log.SegmentList.truncLast(SegmentList.scala:82)
>         at kafka.log.Log.truncateTo(Log.scala:471)
>         at kafka.cluster.Partition.makeFollower(Partition.scala:171)
>         at kafka.cluster.Partition.makeLeaderOrFollower(Partition.scala:126)
>         at kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeFollower(ReplicaManager.scala:195)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:154)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:144)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:631)
>         at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161)
>         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:80)
>         at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:144)
>         at kafka.server.KafkaApis.handleLeaderAndISRRequest(KafkaApis.scala:73)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:40)
>         at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-514) Replication with Leader Failure Test: Log segment files checksum mismatch

Posted by "Jun Rao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/KAFKA-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jun Rao updated KAFKA-514:
--------------------------

    Attachment: kafka-514_v1.patch

This seems to be the same problem as in kafka-525, which is supposed to be fixed in kafka-42. Adding a temporary patch to fix this specific issue. Could you try if this fixes the issue? 
                
> Replication with Leader Failure Test: Log segment files checksum mismatch
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-514
>                 URL: https://issues.apache.org/jira/browse/KAFKA-514
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: John Fung
>            Priority: Blocker
>              Labels: replication-testing
>             Fix For: 0.8
>
>         Attachments: kafka-514-reproduce-issue.patch, kafka-514_v1.patch, system_test_output_archive.tar.gz, testcase_2.tar
>
>
> Test Description:
>    1. Produce and consume messages to 1 topics and 3 partitions.
>    2. This test sends 10 messages every 2 sec to 3 replicas.
>    3. At the end verifies the log size and contents as well as using a consumer to verify that there is no message loss.
> The issue:
> When the leader is terminated by a controlled failure (kill -15), the resulting log segment files size are not all matching. The mismatch log segment size would happen in one of the partition of the terminated broker. This is consistently reproducible from the system regression test for replication with the following configurations:
>     * zookeeper: 1-node (local)
>     * brokers: 3-node cluster (all local)
>     * replica factor: 3
>     * no. of topic: 1
>     * no. of partition: 2
>     * iterations of leader failure: 1
> Remarks:
>     * It is rarely reproducible if the no. of partitions is 1.
>     * Even the file checksums are not matching, the no. of messages in the producer & consumer logs are equal
> Test result (shown with log file checksum):
> broker-1 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-2 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-3 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 3530842923    <<<< not matching across all replicas
> Errors:
> The following error is found in the terminated leader:
> [2012-09-14 11:07:05,217] WARN No previously checkpointed highwatermark value found for topic test_1 partition 1. Returning 0 as the highwatermark (kafka.server.HighwaterMarkCheckpoint)
> [2012-09-14 11:07:05,220] ERROR Replica Manager on Broker 3: Error processing leaderAndISR request LeaderAndIsrRequest(1,,true,1000,Map((test_1,1) -> { "ISR": "1,2","leader": "1","leaderEpoch": "0" }, (test_1,0) -> { "ISR": "
> 1,2","leader": "1","leaderEpoch": "1" })) (kafka.server.ReplicaManager)
> kafka.common.KafkaException: End index must be segment list size - 1
>         at kafka.log.SegmentList.truncLast(SegmentList.scala:82)
>         at kafka.log.Log.truncateTo(Log.scala:471)
>         at kafka.cluster.Partition.makeFollower(Partition.scala:171)
>         at kafka.cluster.Partition.makeLeaderOrFollower(Partition.scala:126)
>         at kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeFollower(ReplicaManager.scala:195)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:154)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:144)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:631)
>         at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161)
>         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:80)
>         at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:144)
>         at kafka.server.KafkaApis.handleLeaderAndISRRequest(KafkaApis.scala:73)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:40)
>         at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-514) Replication with Leader Failure Test: Log segment files checksum mismatch

Posted by "John Fung (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/KAFKA-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471038#comment-13471038 ] 

John Fung commented on KAFKA-514:
---------------------------------

Thanks Jun for patch v2. The system test is now passing consistently with the original full metrics.json.
                
> Replication with Leader Failure Test: Log segment files checksum mismatch
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-514
>                 URL: https://issues.apache.org/jira/browse/KAFKA-514
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: John Fung
>            Priority: Blocker
>              Labels: replication-testing
>             Fix For: 0.8
>
>         Attachments: kafka-514-reproduce-issue.patch, kafka-514_v1.patch, kafka-514_v2.patch, system_test_output_archive.tar.gz, testcase_2.tar
>
>
> Test Description:
>    1. Produce and consume messages to 1 topics and 3 partitions.
>    2. This test sends 10 messages every 2 sec to 3 replicas.
>    3. At the end verifies the log size and contents as well as using a consumer to verify that there is no message loss.
> The issue:
> When the leader is terminated by a controlled failure (kill -15), the resulting log segment files size are not all matching. The mismatch log segment size would happen in one of the partition of the terminated broker. This is consistently reproducible from the system regression test for replication with the following configurations:
>     * zookeeper: 1-node (local)
>     * brokers: 3-node cluster (all local)
>     * replica factor: 3
>     * no. of topic: 1
>     * no. of partition: 2
>     * iterations of leader failure: 1
> Remarks:
>     * It is rarely reproducible if the no. of partitions is 1.
>     * Even the file checksums are not matching, the no. of messages in the producer & consumer logs are equal
> Test result (shown with log file checksum):
> broker-1 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-2 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-3 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 3530842923    <<<< not matching across all replicas
> Errors:
> The following error is found in the terminated leader:
> [2012-09-14 11:07:05,217] WARN No previously checkpointed highwatermark value found for topic test_1 partition 1. Returning 0 as the highwatermark (kafka.server.HighwaterMarkCheckpoint)
> [2012-09-14 11:07:05,220] ERROR Replica Manager on Broker 3: Error processing leaderAndISR request LeaderAndIsrRequest(1,,true,1000,Map((test_1,1) -> { "ISR": "1,2","leader": "1","leaderEpoch": "0" }, (test_1,0) -> { "ISR": "
> 1,2","leader": "1","leaderEpoch": "1" })) (kafka.server.ReplicaManager)
> kafka.common.KafkaException: End index must be segment list size - 1
>         at kafka.log.SegmentList.truncLast(SegmentList.scala:82)
>         at kafka.log.Log.truncateTo(Log.scala:471)
>         at kafka.cluster.Partition.makeFollower(Partition.scala:171)
>         at kafka.cluster.Partition.makeLeaderOrFollower(Partition.scala:126)
>         at kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeFollower(ReplicaManager.scala:195)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:154)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:144)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:631)
>         at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161)
>         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:80)
>         at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:144)
>         at kafka.server.KafkaApis.handleLeaderAndISRRequest(KafkaApis.scala:73)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:40)
>         at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-514) Replication with Leader Failure Test: Log segment files checksum mismatch

Posted by "Jun Rao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/KAFKA-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jun Rao updated KAFKA-514:
--------------------------

    Attachment: kafka-514_v2.patch

Attach patch v2 (includes v1 changes). This is just a temporary fix for kafka-551. Now the system test passes for me. Could you give it a try?
                
> Replication with Leader Failure Test: Log segment files checksum mismatch
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-514
>                 URL: https://issues.apache.org/jira/browse/KAFKA-514
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: John Fung
>            Priority: Blocker
>              Labels: replication-testing
>             Fix For: 0.8
>
>         Attachments: kafka-514-reproduce-issue.patch, kafka-514_v1.patch, kafka-514_v2.patch, system_test_output_archive.tar.gz, testcase_2.tar
>
>
> Test Description:
>    1. Produce and consume messages to 1 topics and 3 partitions.
>    2. This test sends 10 messages every 2 sec to 3 replicas.
>    3. At the end verifies the log size and contents as well as using a consumer to verify that there is no message loss.
> The issue:
> When the leader is terminated by a controlled failure (kill -15), the resulting log segment files size are not all matching. The mismatch log segment size would happen in one of the partition of the terminated broker. This is consistently reproducible from the system regression test for replication with the following configurations:
>     * zookeeper: 1-node (local)
>     * brokers: 3-node cluster (all local)
>     * replica factor: 3
>     * no. of topic: 1
>     * no. of partition: 2
>     * iterations of leader failure: 1
> Remarks:
>     * It is rarely reproducible if the no. of partitions is 1.
>     * Even the file checksums are not matching, the no. of messages in the producer & consumer logs are equal
> Test result (shown with log file checksum):
> broker-1 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-2 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-3 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 3530842923    <<<< not matching across all replicas
> Errors:
> The following error is found in the terminated leader:
> [2012-09-14 11:07:05,217] WARN No previously checkpointed highwatermark value found for topic test_1 partition 1. Returning 0 as the highwatermark (kafka.server.HighwaterMarkCheckpoint)
> [2012-09-14 11:07:05,220] ERROR Replica Manager on Broker 3: Error processing leaderAndISR request LeaderAndIsrRequest(1,,true,1000,Map((test_1,1) -> { "ISR": "1,2","leader": "1","leaderEpoch": "0" }, (test_1,0) -> { "ISR": "
> 1,2","leader": "1","leaderEpoch": "1" })) (kafka.server.ReplicaManager)
> kafka.common.KafkaException: End index must be segment list size - 1
>         at kafka.log.SegmentList.truncLast(SegmentList.scala:82)
>         at kafka.log.Log.truncateTo(Log.scala:471)
>         at kafka.cluster.Partition.makeFollower(Partition.scala:171)
>         at kafka.cluster.Partition.makeLeaderOrFollower(Partition.scala:126)
>         at kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeFollower(ReplicaManager.scala:195)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:154)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:144)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:631)
>         at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161)
>         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:80)
>         at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:144)
>         at kafka.server.KafkaApis.handleLeaderAndISRRequest(KafkaApis.scala:73)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:40)
>         at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-514) Replication with Leader Failure Test: Log segment files checksum mismatch

Posted by "John Fung (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/KAFKA-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Fung updated KAFKA-514:
----------------------------

    Attachment:     (was: system_test_output_archive.tar)
    
> Replication with Leader Failure Test: Log segment files checksum mismatch
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-514
>                 URL: https://issues.apache.org/jira/browse/KAFKA-514
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: John Fung
>            Priority: Blocker
>              Labels: replication-testing
>             Fix For: 0.8
>
>         Attachments: system_test_output_archive.tar.gz, testcase_2.tar
>
>
> Test Description:
>    1. Produce and consume messages to 1 topics and 3 partitions.
>    2. This test sends 10 messages every 2 sec to 3 replicas.
>    3. At the end verifies the log size and contents as well as using a consumer to verify that there is no message loss.
> The issue:
> When the leader is terminated by a controlled failure (kill -15), the resulting log segment files size are not all matching. The mismatch log segment size would happen in one of the partition of the terminated broker. This is consistently reproducible from the system regression test for replication with the following configurations:
>     * zookeeper: 1-node (local)
>     * brokers: 3-node cluster (all local)
>     * replica factor: 3
>     * no. of topic: 1
>     * no. of partition: 2
>     * iterations of leader failure: 1
> Remarks:
>     * It is rarely reproducible if the no. of partitions is 1.
>     * Even the file checksums are not matching, the no. of messages in the producer & consumer logs are equal
> Test result (shown with log file checksum):
> broker-1 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-2 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-3 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 3530842923    <<<< not matching across all replicas
> Errors:
> The following error is found in the terminated leader:
> [2012-09-14 11:07:05,217] WARN No previously checkpointed highwatermark value found for topic test_1 partition 1. Returning 0 as the highwatermark (kafka.server.HighwaterMarkCheckpoint)
> [2012-09-14 11:07:05,220] ERROR Replica Manager on Broker 3: Error processing leaderAndISR request LeaderAndIsrRequest(1,,true,1000,Map((test_1,1) -> { "ISR": "1,2","leader": "1","leaderEpoch": "0" }, (test_1,0) -> { "ISR": "
> 1,2","leader": "1","leaderEpoch": "1" })) (kafka.server.ReplicaManager)
> kafka.common.KafkaException: End index must be segment list size - 1
>         at kafka.log.SegmentList.truncLast(SegmentList.scala:82)
>         at kafka.log.Log.truncateTo(Log.scala:471)
>         at kafka.cluster.Partition.makeFollower(Partition.scala:171)
>         at kafka.cluster.Partition.makeLeaderOrFollower(Partition.scala:126)
>         at kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeFollower(ReplicaManager.scala:195)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:154)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:144)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:631)
>         at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161)
>         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:80)
>         at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:144)
>         at kafka.server.KafkaApis.handleLeaderAndISRRequest(KafkaApis.scala:73)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:40)
>         at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-514) Replication with Leader Failure Test: Log segment files checksum mismatch

Posted by "John Fung (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/KAFKA-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470753#comment-13470753 ] 

John Fung commented on KAFKA-514:
---------------------------------

Uploaded kafka-514-reproduce-issue.patch to reproduce the issue:

1. Download the latest 0.8 branch
2. Apply kafka-514-reproduce-issue.patch
3. Under directory <kafka_home>, execute "./sbt update package" to build Kafka 
3. Under directory <kafka_home>/system_test, execute "python -B system_test_runner.py"

                
> Replication with Leader Failure Test: Log segment files checksum mismatch
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-514
>                 URL: https://issues.apache.org/jira/browse/KAFKA-514
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: John Fung
>            Priority: Blocker
>              Labels: replication-testing
>             Fix For: 0.8
>
>         Attachments: kafka-514-reproduce-issue.patch, system_test_output_archive.tar.gz, testcase_2.tar
>
>
> Test Description:
>    1. Produce and consume messages to 1 topics and 3 partitions.
>    2. This test sends 10 messages every 2 sec to 3 replicas.
>    3. At the end verifies the log size and contents as well as using a consumer to verify that there is no message loss.
> The issue:
> When the leader is terminated by a controlled failure (kill -15), the resulting log segment files size are not all matching. The mismatch log segment size would happen in one of the partition of the terminated broker. This is consistently reproducible from the system regression test for replication with the following configurations:
>     * zookeeper: 1-node (local)
>     * brokers: 3-node cluster (all local)
>     * replica factor: 3
>     * no. of topic: 1
>     * no. of partition: 2
>     * iterations of leader failure: 1
> Remarks:
>     * It is rarely reproducible if the no. of partitions is 1.
>     * Even the file checksums are not matching, the no. of messages in the producer & consumer logs are equal
> Test result (shown with log file checksum):
> broker-1 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-2 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-3 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 3530842923    <<<< not matching across all replicas
> Errors:
> The following error is found in the terminated leader:
> [2012-09-14 11:07:05,217] WARN No previously checkpointed highwatermark value found for topic test_1 partition 1. Returning 0 as the highwatermark (kafka.server.HighwaterMarkCheckpoint)
> [2012-09-14 11:07:05,220] ERROR Replica Manager on Broker 3: Error processing leaderAndISR request LeaderAndIsrRequest(1,,true,1000,Map((test_1,1) -> { "ISR": "1,2","leader": "1","leaderEpoch": "0" }, (test_1,0) -> { "ISR": "
> 1,2","leader": "1","leaderEpoch": "1" })) (kafka.server.ReplicaManager)
> kafka.common.KafkaException: End index must be segment list size - 1
>         at kafka.log.SegmentList.truncLast(SegmentList.scala:82)
>         at kafka.log.Log.truncateTo(Log.scala:471)
>         at kafka.cluster.Partition.makeFollower(Partition.scala:171)
>         at kafka.cluster.Partition.makeLeaderOrFollower(Partition.scala:126)
>         at kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeFollower(ReplicaManager.scala:195)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:154)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:144)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:631)
>         at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161)
>         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:80)
>         at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:144)
>         at kafka.server.KafkaApis.handleLeaderAndISRRequest(KafkaApis.scala:73)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:40)
>         at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-514) Replication with Leader Failure Test: Log segment files checksum mismatch

Posted by "Joel Koshy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/KAFKA-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joel Koshy updated KAFKA-514:
-----------------------------

             Priority: Blocker  (was: Major)
    Affects Version/s: 0.8
        Fix Version/s: 0.8
    
> Replication with Leader Failure Test: Log segment files checksum mismatch
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-514
>                 URL: https://issues.apache.org/jira/browse/KAFKA-514
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: John Fung
>            Priority: Blocker
>             Fix For: 0.8
>
>
> Test Description:
>    1. Produce and consume messages to 1 topics and 3 partitions.
>    2. This test sends 10 messages every 2 sec to 3 replicas.
>    3. At the end verifies the log size and contents as well as using a consumer to verify that there is no message loss.
> The issue:
> When the leader is terminated by a controlled failure (kill -15), the resulting log segment files size are not all matching. The mismatch log segment size would happen in one of the partition of the terminated broker. This is consistently reproducible from the system regression test for replication with the following configurations:
>     * zookeeper: 1-node (local)
>     * brokers: 3-node cluster (all local)
>     * replica factor: 3
>     * no. of topic: 1
>     * no. of partition: 2
>     * iterations of leader failure: 1
> Remarks:
>     * It is rarely reproducible if the no. of partitions is 1.
>     * Even the file checksums are not matching, the no. of messages in the producer & consumer logs are equal
> Test result (shown with log file checksum):
> broker-1 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-2 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-3 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 3530842923    <<<< not matching across all replicas
> Errors:
> The following error is found in the terminated leader:
> [2012-09-14 11:07:05,217] WARN No previously checkpointed highwatermark value found for topic test_1 partition 1. Returning 0 as the highwatermark (kafka.server.HighwaterMarkCheckpoint)
> [2012-09-14 11:07:05,220] ERROR Replica Manager on Broker 3: Error processing leaderAndISR request LeaderAndIsrRequest(1,,true,1000,Map((test_1,1) -> { "ISR": "1,2","leader": "1","leaderEpoch": "0" }, (test_1,0) -> { "ISR": "
> 1,2","leader": "1","leaderEpoch": "1" })) (kafka.server.ReplicaManager)
> kafka.common.KafkaException: End index must be segment list size - 1
>         at kafka.log.SegmentList.truncLast(SegmentList.scala:82)
>         at kafka.log.Log.truncateTo(Log.scala:471)
>         at kafka.cluster.Partition.makeFollower(Partition.scala:171)
>         at kafka.cluster.Partition.makeLeaderOrFollower(Partition.scala:126)
>         at kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeFollower(ReplicaManager.scala:195)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:154)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:144)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:631)
>         at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161)
>         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:80)
>         at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:144)
>         at kafka.server.KafkaApis.handleLeaderAndISRRequest(KafkaApis.scala:73)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:40)
>         at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (KAFKA-514) Replication with Leader Failure Test: Log segment files checksum mismatch

Posted by "John Fung (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/KAFKA-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469833#comment-13469833 ] 

John Fung edited comment on KAFKA-514 at 10/5/12 10:34 AM:
-----------------------------------------------------------

This issue can be reproduced as follows:

1. Download the latest 0.8 branch
2. Apply kafka-502-v4.patch
3. Under directory <kafka_home>, execute "./sbt update package" to build Kafka
4. Untar testcase_2.tar to <kafka_home>/system_test/replication_testsuite/
5. Modified <kafka_home>/system_test/testcase_to_run.json from "testcase_1" to "testcase_2"
6. Under directory <kafka_home>/system_test, execute "python -B system_test_runner.py"
7. The main test framework console output, broker logs, broker data log segment files are tarred in the file system_test_output_archive.tar.

In this specific test run, there is a log segment file missing in broker-2:

broker-1 :
test_1-0/00000000000000000000.kafka => 4201569950
test_1-0/00000000000000102510.kafka => 1868104866
test_1-0/00000000000000205020.kafka => 1753379349
test_1-0/00000000000000307530.kafka => 1518305117
test_1-0/00000000000000410040.kafka => 3676899141    <<<< not matching across all replicas

broker-2 :
test_1-0/00000000000000000000.kafka => 4201569950
test_1-0/00000000000000102510.kafka => 1868104866
test_1-0/00000000000000205020.kafka => 1753379349
test_1-0/00000000000000307530.kafka => 1518305117

broker-3 :
test_1-0/00000000000000000000.kafka => 4201569950
test_1-0/00000000000000102510.kafka => 1868104866
test_1-0/00000000000000205020.kafka => 1753379349
test_1-0/00000000000000307530.kafka => 1518305117
test_1-0/00000000000000410040.kafka => 3676899141    <<<< not matching across all replicas

                
      was (Author: jfung):
    This issue can be reproduced as follows:

1. Download the latest 0.8 branch
2. Apply kafka-502-v4.patch
3. Under directory <kafka_home>, execute "./sbt update package" to build Kafka
4. Untar testcase_2.tar to <kafka_home>/system_test/replication_testsuite/
5. Modified <kafka_home>/system_test/testcase_to_run.json from "testcase_1" to "testcase_2"
6. Under directory <kafka_home>/system_test, execute "python -B system_test_runner.py"
                  
> Replication with Leader Failure Test: Log segment files checksum mismatch
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-514
>                 URL: https://issues.apache.org/jira/browse/KAFKA-514
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: John Fung
>            Priority: Blocker
>              Labels: replication-testing
>             Fix For: 0.8
>
>         Attachments: system_test_output_archive.tar.gz, testcase_2.tar
>
>
> Test Description:
>    1. Produce and consume messages to 1 topics and 3 partitions.
>    2. This test sends 10 messages every 2 sec to 3 replicas.
>    3. At the end verifies the log size and contents as well as using a consumer to verify that there is no message loss.
> The issue:
> When the leader is terminated by a controlled failure (kill -15), the resulting log segment files size are not all matching. The mismatch log segment size would happen in one of the partition of the terminated broker. This is consistently reproducible from the system regression test for replication with the following configurations:
>     * zookeeper: 1-node (local)
>     * brokers: 3-node cluster (all local)
>     * replica factor: 3
>     * no. of topic: 1
>     * no. of partition: 2
>     * iterations of leader failure: 1
> Remarks:
>     * It is rarely reproducible if the no. of partitions is 1.
>     * Even the file checksums are not matching, the no. of messages in the producer & consumer logs are equal
> Test result (shown with log file checksum):
> broker-1 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-2 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-3 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 3530842923    <<<< not matching across all replicas
> Errors:
> The following error is found in the terminated leader:
> [2012-09-14 11:07:05,217] WARN No previously checkpointed highwatermark value found for topic test_1 partition 1. Returning 0 as the highwatermark (kafka.server.HighwaterMarkCheckpoint)
> [2012-09-14 11:07:05,220] ERROR Replica Manager on Broker 3: Error processing leaderAndISR request LeaderAndIsrRequest(1,,true,1000,Map((test_1,1) -> { "ISR": "1,2","leader": "1","leaderEpoch": "0" }, (test_1,0) -> { "ISR": "
> 1,2","leader": "1","leaderEpoch": "1" })) (kafka.server.ReplicaManager)
> kafka.common.KafkaException: End index must be segment list size - 1
>         at kafka.log.SegmentList.truncLast(SegmentList.scala:82)
>         at kafka.log.Log.truncateTo(Log.scala:471)
>         at kafka.cluster.Partition.makeFollower(Partition.scala:171)
>         at kafka.cluster.Partition.makeLeaderOrFollower(Partition.scala:126)
>         at kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeFollower(ReplicaManager.scala:195)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:154)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:144)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:631)
>         at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161)
>         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:80)
>         at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:144)
>         at kafka.server.KafkaApis.handleLeaderAndISRRequest(KafkaApis.scala:73)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:40)
>         at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-514) Replication with Leader Failure Test: Log segment files checksum mismatch

Posted by "John Fung (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/KAFKA-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Fung updated KAFKA-514:
----------------------------

    Attachment: testcase_2.tar
    
> Replication with Leader Failure Test: Log segment files checksum mismatch
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-514
>                 URL: https://issues.apache.org/jira/browse/KAFKA-514
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: John Fung
>            Priority: Blocker
>              Labels: replication-testing
>             Fix For: 0.8
>
>         Attachments: system_test_output_archive.tar, testcase_2.tar
>
>
> Test Description:
>    1. Produce and consume messages to 1 topics and 3 partitions.
>    2. This test sends 10 messages every 2 sec to 3 replicas.
>    3. At the end verifies the log size and contents as well as using a consumer to verify that there is no message loss.
> The issue:
> When the leader is terminated by a controlled failure (kill -15), the resulting log segment files size are not all matching. The mismatch log segment size would happen in one of the partition of the terminated broker. This is consistently reproducible from the system regression test for replication with the following configurations:
>     * zookeeper: 1-node (local)
>     * brokers: 3-node cluster (all local)
>     * replica factor: 3
>     * no. of topic: 1
>     * no. of partition: 2
>     * iterations of leader failure: 1
> Remarks:
>     * It is rarely reproducible if the no. of partitions is 1.
>     * Even the file checksums are not matching, the no. of messages in the producer & consumer logs are equal
> Test result (shown with log file checksum):
> broker-1 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-2 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across all replicas
> broker-3 :
> test_1-0/00000000000000000000.kafka => 1690639555
> test_1-1/00000000000000000000.kafka => 3530842923    <<<< not matching across all replicas
> Errors:
> The following error is found in the terminated leader:
> [2012-09-14 11:07:05,217] WARN No previously checkpointed highwatermark value found for topic test_1 partition 1. Returning 0 as the highwatermark (kafka.server.HighwaterMarkCheckpoint)
> [2012-09-14 11:07:05,220] ERROR Replica Manager on Broker 3: Error processing leaderAndISR request LeaderAndIsrRequest(1,,true,1000,Map((test_1,1) -> { "ISR": "1,2","leader": "1","leaderEpoch": "0" }, (test_1,0) -> { "ISR": "
> 1,2","leader": "1","leaderEpoch": "1" })) (kafka.server.ReplicaManager)
> kafka.common.KafkaException: End index must be segment list size - 1
>         at kafka.log.SegmentList.truncLast(SegmentList.scala:82)
>         at kafka.log.Log.truncateTo(Log.scala:471)
>         at kafka.cluster.Partition.makeFollower(Partition.scala:171)
>         at kafka.cluster.Partition.makeLeaderOrFollower(Partition.scala:126)
>         at kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeFollower(ReplicaManager.scala:195)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:154)
>         at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:144)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:631)
>         at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161)
>         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:80)
>         at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:144)
>         at kafka.server.KafkaApis.handleLeaderAndISRRequest(KafkaApis.scala:73)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:40)
>         at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira