You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "C0urante (via GitHub)" <gi...@apache.org> on 2023/06/03 18:07:31 UTC

[GitHub] [kafka] C0urante opened a new pull request, #13806: MINOR: Fix flaky DistributedHerderTest cases related to zombie fencing

C0urante opened a new pull request, #13806:
URL: https://github.com/apache/kafka/pull/13806

   I frequently see failures when running these test cases locally caused by fewer-than-expected (2 instead of 3) invocations of `member::wakeup`.
   
   This turns out to be due to a benign race condition in `DistributedHerder::addRequest`, which is documented in a comment in that method. A comment is also added to `DistributedHerderTest::testTaskRequestedZombieFencingForwardingToLeader` explaining why we relax our expectations about how many invocations of `member::wakeup` take place.
   
   As an alternative, we could consider adding synchronization around access to the herder request queue. But the additional implementation complexity and risks of deadlock don't seem worth the finer-grained testing logic we'd be able to write.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] yashmayya commented on a diff in pull request #13806: MINOR: Fix flaky DistributedHerderTest cases related to zombie fencing

Posted by "yashmayya (via GitHub)" <gi...@apache.org>.
yashmayya commented on code in PR #13806:
URL: https://github.com/apache/kafka/pull/13806#discussion_r1216948753


##########
connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java:
##########
@@ -2212,6 +2212,13 @@ DistributedHerderRequest addRequest(Callable<Void> action, Callback<Void> callba
     DistributedHerderRequest addRequest(long delayMs, Callable<Void> action, Callback<Void> callback) {
         DistributedHerderRequest req = new DistributedHerderRequest(time.milliseconds() + delayMs, requestSeqNum.incrementAndGet(), action, callback);
         requests.add(req);
+        // We don't need to synchronize here
+        // If the condition evaluates to true, we can and should trigger a wakeup
+        // If it evaluates to false because our request has suddenly been popped off of the queue, then
+        //     the herder is already running the request and no wakeup is necessary

Review Comment:
   nit: unnecessary indentation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] C0urante commented on a diff in pull request #13806: MINOR: Fix flaky DistributedHerderTest cases related to zombie fencing

Posted by "C0urante (via GitHub)" <gi...@apache.org>.
C0urante commented on code in PR #13806:
URL: https://github.com/apache/kafka/pull/13806#discussion_r1218133111


##########
connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java:
##########
@@ -2212,6 +2212,13 @@ DistributedHerderRequest addRequest(Callable<Void> action, Callback<Void> callba
     DistributedHerderRequest addRequest(long delayMs, Callable<Void> action, Callback<Void> callback) {
         DistributedHerderRequest req = new DistributedHerderRequest(time.milliseconds() + delayMs, requestSeqNum.incrementAndGet(), action, callback);
         requests.add(req);
+        // We don't need to synchronize here
+        // If the condition evaluates to true, we can and should trigger a wakeup
+        // If it evaluates to false because our request has suddenly been popped off of the queue, then
+        //     the herder is already running the request and no wakeup is necessary

Review Comment:
   Was hoping to make the separation between the different conditions clearer (sort of but not quite like how we indent if/else chains in Java). Guess that doesn't read well; I've removed the extra indentation here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] C0urante commented on pull request #13806: MINOR: Fix flaky DistributedHerderTest cases related to zombie fencing

Posted by "C0urante (via GitHub)" <gi...@apache.org>.
C0urante commented on PR #13806:
URL: https://github.com/apache/kafka/pull/13806#issuecomment-1575103799

   @yashmayya @gharris1727 would either of you have a moment to take a look at this? Also curious if you've encountered these test failures as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] C0urante commented on pull request #13806: MINOR: Fix flaky DistributedHerderTest cases related to zombie fencing

Posted by "C0urante (via GitHub)" <gi...@apache.org>.
C0urante commented on PR #13806:
URL: https://github.com/apache/kafka/pull/13806#issuecomment-1576887637

   @mimaison @showuon would appreciate a review if you have time, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] C0urante commented on pull request #13806: MINOR: Fix flaky DistributedHerderTest cases related to zombie fencing

Posted by "C0urante (via GitHub)" <gi...@apache.org>.
C0urante commented on PR #13806:
URL: https://github.com/apache/kafka/pull/13806#issuecomment-1577380453

   Test failures appear unrelated. Merging...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] C0urante merged pull request #13806: MINOR: Fix flaky DistributedHerderTest cases related to zombie fencing

Posted by "C0urante (via GitHub)" <gi...@apache.org>.
C0urante merged PR #13806:
URL: https://github.com/apache/kafka/pull/13806


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] C0urante commented on pull request #13806: MINOR: Fix flaky DistributedHerderTest cases related to zombie fencing

Posted by "C0urante (via GitHub)" <gi...@apache.org>.
C0urante commented on PR #13806:
URL: https://github.com/apache/kafka/pull/13806#issuecomment-1577037550

   Thanks Mickael! 🚀 
   
   Will merge pending CI build.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org