You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2021/04/03 11:13:39 UTC
[GitHub] [kafka] vamossagar12 commented on a change in pull request #10468: Kafka 12373

vamossagar12 commented on a change in pull request #10468:
URL: https://github.com/apache/kafka/pull/10468#discussion_r606652084



##########
File path: raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientTest.java
##########
@@ -1673,6 +1673,69 @@ public void testLeaderGracefulShutdownTimeout() throws Exception {
         assertFutureThrows(shutdownFuture, TimeoutException.class);
     }
 
+    @Test
+    public void testLeaderGracefulShutdownOnClose() throws Exception {
+        int localId = 0;
+        int otherNodeId = 1;
+        int lingerMs = 50;
+        Set<Integer> voters = Utils.mkSet(localId, otherNodeId);
+
+        RaftClientTestContext context = new RaftClientTestContext.Builder(localId, voters)
+            .withAppendLingerMs(lingerMs)
+            .build();
+
+        context.becomeLeader();
+        assertEquals(OptionalInt.of(localId), context.currentLeader());
+        assertEquals(1L, context.log.endOffset().offset);
+
+        int epoch = context.currentEpoch();
+        assertEquals(1L, context.client.scheduleAppend(epoch, singletonList("a")));
+
+        context.client.poll();
+        assertEquals(OptionalLong.of(lingerMs), context.messageQueue.lastPollTimeoutMs());
+
+        context.time.sleep(20);
+
+        // client closed now.
+        context.client.close();
+
+        // Flag for accepting appends should be toggled to false.
+        assertFalse(context.client.canAcceptAppends());
+
+        // acceptAppends flag set to false so no writes should be accepted by the Leader now.
+        assertNull(context.client.scheduleAppend(epoch, singletonList("b")));
+
+        // The leader should trigger a flush for whatever batches are present in the BatchAccumulator
+        assertEquals(2L, context.log.endOffset().offset);
+
+        // Now shutdown
+
+        // We should still be running until we have had a chance to send EndQuorumEpoch
+        assertTrue(context.client.isShuttingDown());
+        assertTrue(context.client.isRunning());
+
+        // Send EndQuorumEpoch request to the other voter
+        context.pollUntilRequest();
+        assertTrue(context.client.isShuttingDown());
+        assertTrue(context.client.isRunning());
+        context.assertSentEndQuorumEpochRequest(1, otherNodeId);
+
+        // We should still be able to handle vote requests during graceful shutdown
+        // in order to help the new leader get elected
+        context.deliverRequest(context.voteRequest(epoch + 1, otherNodeId, epoch, 1L));
+        context.client.poll();
+        context.assertSentVoteResponse(Errors.NONE, epoch + 1, OptionalInt.empty(), true);
+

Review comment:
       @jsancio I needed some pointers here- maybe I am doing something wrong. If you check the close method, I am ensuring that any records in the leader's log are flushed. Then I am invoking the graceful shutdown through the leader. I was checking the other test cases, and one of the things I found was that the leader could still participate in the VoteRequest while it's being gracefully shutdown. Couple of questions:
   
   1) The voteGranted returns as false here as, the quorum goes into an unattached state and this condition fails:
   
   ```
   OffsetAndEpoch lastEpochEndOffsetAndEpoch = new OffsetAndEpoch(lastEpochEndOffset, lastEpoch);
   voteGranted = lastEpochEndOffsetAndEpoch.compareTo(endOffset()) >= 0;
   ```
   
   The reason is that the LEO of the being closed leader is still higher than that of the candidate which just sent a Vote request. So, at this point, it may never receive the grant for the vote. When I remove the logic to flush batches from close(), this condition passes. 
   
   So, should the leader even be able to handle vote requests at this point? And if the leader is going away anyways and let's say if it doesn't come back, then would flushing it's logs help? The issue could be solved by the 3rd point in the ticket i.e 
   
   > Wait with a timeout for the high-watermark to reach the LEO
   
   But again, for this the leader needs to be able to handle fetch requests at this point, right? My doubt here is, if it doesn't receive the fetch requests OR if the timeout you mentioned above is hit, then we will end up with the same issue that I described above.
   
   Plz let me know if there's a flaw in my understanding here.. Thanks!
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org