You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2020/12/13 12:16:10 UTC

[GitHub] [ozone] adoroszlai opened a new pull request #1697: HDDS-4013. FLAKY-UT: TestWatchForCommit#testWatchForCommitForGroupMismatchException

adoroszlai opened a new pull request #1697:
URL: https://github.com/apache/ozone/pull/1697


   ## What changes were proposed in this pull request?
   
   `testWatchForCommitForGroupMismatchException` may fail with `Group ... not found` in `waitForPipelineClose`.  The method destroys pipelines via regular SCM -> datanodes path, but then also calls `removeGroup` for each member manually.
   
   https://github.com/apache/ozone/blob/f30aba704e4b2929764e04db845d8e0cf54e261b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/ClosePipelineCommandHandler.java#L74
   
   Only one of these can succeed, the group cannot be removed twice.  If test code wins, we see this error in the output:
   
   ```
   2020-12-11 11:59:14,841 [Command processor thread] ERROR commandhandler.ClosePipelineCommandHandler (ClosePipelineCommandHandler.java:handle(78)) - Can't close pipeline PipelineID=82ba59fa-dd35-4d4e-a61e-59b281828b58
   java.io.IOException: 675d35ba-bb57-4ae2-b121-2cd47ec358f4: Group group-59B281828B58 not found.
   	at org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.removeGroup(XceiverServerRatis.java:774)
   	at org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler.handle(ClosePipelineCommandHandler.java:74)
   ```
   
   But if `ClosePipelineCommandHandler` is first to remove, the test fails:
   
   ```
   testWatchForCommitForGroupMismatchException(org.apache.hadoop.ozone.client.rpc.TestWatchForCommit)  Time elapsed: 39.399 s  <<< ERROR!
   java.io.IOException: 14058016-3fcf-4920-991e-a5361899d5d7: Group group-59B281828B58 not found.
   	at org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.removeGroup(XceiverServerRatis.java:774)
   	at org.apache.hadoop.ozone.container.TestHelper.waitForPipelineClose(TestHelper.java:242)
   	at org.apache.hadoop.ozone.client.rpc.TestWatchForCommit.testWatchForCommitForGroupMismatchException(TestWatchForCommit.java:353)
   ```
   
   https://issues.apache.org/jira/browse/HDDS-4013
   
   ## How was this patch tested?
   
   Ran `TestWatchForCommit` 50x - never failed, only timed out once:
   https://github.com/adoroszlai/hadoop-ozone/runs/1544890660
   
   Regular CI:
   https://github.com/adoroszlai/hadoop-ozone/actions/runs/418502626


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai commented on pull request #1697: HDDS-4013. FLAKY-UT: TestWatchForCommit#testWatchForCommitForGroupMismatchException

Posted by GitBox <gi...@apache.org>.
adoroszlai commented on pull request #1697:
URL: https://github.com/apache/ozone/pull/1697#issuecomment-744421182


   Thanks @ayushtkn for the review.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ayushtkn commented on a change in pull request #1697: HDDS-4013. FLAKY-UT: TestWatchForCommit#testWatchForCommitForGroupMismatchException

Posted by GitBox <gi...@apache.org>.
ayushtkn commented on a change in pull request #1697:
URL: https://github.com/apache/ozone/pull/1697#discussion_r542302227



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/container/TestHelper.java
##########
@@ -234,12 +235,14 @@ public static void waitForPipelineClose(List<Pipeline> pipelineList,
 
     // wait for the pipeline to get destroyed in the datanodes
     for (Pipeline pipeline : pipelineList) {
+      HddsProtos.PipelineID pipelineId = pipeline.getId().getProtobuf();
       for (DatanodeDetails dn : pipeline.getNodes()) {
         XceiverServerSpi server =
             cluster.getHddsDatanodes().get(cluster.getHddsDatanodeIndex(dn))
                 .getDatanodeStateMachine().getContainer().getWriteChannel();
         Assert.assertTrue(server instanceof XceiverServerRatis);
-        server.removeGroup(pipeline.getId().getProtobuf());
+        GenericTestUtils.waitFor(() -> !server.isExist(pipelineId),
+            100, 30_000);

Review comment:
       makes better to stay near the prod code. Thanx




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] ayushtkn commented on a change in pull request #1697: HDDS-4013. FLAKY-UT: TestWatchForCommit#testWatchForCommitForGroupMismatchException

Posted by GitBox <gi...@apache.org>.
ayushtkn commented on a change in pull request #1697:
URL: https://github.com/apache/ozone/pull/1697#discussion_r542110687



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/container/TestHelper.java
##########
@@ -234,12 +235,14 @@ public static void waitForPipelineClose(List<Pipeline> pipelineList,
 
     // wait for the pipeline to get destroyed in the datanodes
     for (Pipeline pipeline : pipelineList) {
+      HddsProtos.PipelineID pipelineId = pipeline.getId().getProtobuf();
       for (DatanodeDetails dn : pipeline.getNodes()) {
         XceiverServerSpi server =
             cluster.getHddsDatanodes().get(cluster.getHddsDatanodeIndex(dn))
                 .getDatanodeStateMachine().getContainer().getWriteChannel();
         Assert.assertTrue(server instanceof XceiverServerRatis);
-        server.removeGroup(pipeline.getId().getProtobuf());
+        GenericTestUtils.waitFor(() -> !server.isExist(pipelineId),
+            100, 30_000);

Review comment:
       Thanx @adoroszlai for the fix, I am able to repro this, changes seems fair enough.
   Does it makes sense to `removeGroup` explicitly in case `server.isExist(pipelineId)` is true, rather than waiting? Might speed up test? We can ignore the exception from `removeGroup` in the wait and retry to counter the race condition




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai merged pull request #1697: HDDS-4013. FLAKY-UT: TestWatchForCommit#testWatchForCommitForGroupMismatchException

Posted by GitBox <gi...@apache.org>.
adoroszlai merged pull request #1697:
URL: https://github.com/apache/ozone/pull/1697


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai commented on a change in pull request #1697: HDDS-4013. FLAKY-UT: TestWatchForCommit#testWatchForCommitForGroupMismatchException

Posted by GitBox <gi...@apache.org>.
adoroszlai commented on a change in pull request #1697:
URL: https://github.com/apache/ozone/pull/1697#discussion_r542267852



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/container/TestHelper.java
##########
@@ -234,12 +235,14 @@ public static void waitForPipelineClose(List<Pipeline> pipelineList,
 
     // wait for the pipeline to get destroyed in the datanodes
     for (Pipeline pipeline : pipelineList) {
+      HddsProtos.PipelineID pipelineId = pipeline.getId().getProtobuf();
       for (DatanodeDetails dn : pipeline.getNodes()) {
         XceiverServerSpi server =
             cluster.getHddsDatanodes().get(cluster.getHddsDatanodeIndex(dn))
                 .getDatanodeStateMachine().getContainer().getWriteChannel();
         Assert.assertTrue(server instanceof XceiverServerRatis);
-        server.removeGroup(pipeline.getId().getProtobuf());
+        GenericTestUtils.waitFor(() -> !server.isExist(pipelineId),
+            100, 30_000);

Review comment:
       Thanks @ayushtkn for taking a look.
   
   I'd rather avoid such "manual intervention" and let the test exercise production code paths.
   
   I don't think keeping `removeGroup` would result in much speedup.  If SCM and datanodes take too much time to execute regular flow, we could probably tweak executor intervals, etc.
   
   It might be possible to convert it to a unit test with mocks, resulting in much bigger speedup by avoiding the mini cluster.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org