You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2021/08/17 23:03:23 UTC

[GitHub] [ozone] errose28 opened a new pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

errose28 opened a new pull request #2549:
URL: https://github.com/apache/ozone/pull/2549


   ## What changes were proposed in this pull request?
   
   Fix intermittent test failure in TestPipelineClose#testPipelineCloseWithLogFailure. The mechanics of the test look good to me. Based on my testing the issue was an overly aggressive timeout waiting for a mockito invocation.
   
   ## What is the link to the Apache JIRA
   
   HDDS-5604
   
   ## How was this patch tested?
   
   Before patch, test failed after 30 runs locally. After patch, The test passed with 100 runs on CI: https://github.com/errose28/hadoop-ozone/runs/3354906907
       - The test was run on a different commit by accident, but the changes are identical. Check the small diff for that commit if unsure.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] errose28 commented on pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
errose28 commented on pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#issuecomment-902128424


   Thanks to everyone who has helped investigate this. Here is my current understanding of how this test is supposed to work:
   
   1. Pipeline created
   2. Log failure triggered on datanode
   3. Datanode is expected to notify SCM of pipeline close without waiting for the heartbeat interval.
   4. Test checks SCM event queue to see if the mock pipeline action handler it registered was called within the timeout.
   
   So although the mini ozone conf being set with the heartbeat interval in the wrong order is bad, it should have no bearing on this test. I verified this by setting the heartbeat interval to 10 seconds. The test behaves the same.
   
   By turning on trace logging for the EventQueue, we can see that even in the failure cases both the original PipelineActionHandler and the mocked PipelineActionHandler are triggered on SCM pretty quickly after the DN log fails. This gives me reasonable confidence that the code is working as expected and the error was in the test. So since the test seems to pass repeatedly with the new timeout, I am ok with the fix.
   
   However, I do not fully understand why we had to increase the timeout to 1 second. SCM event queue fires asynchronuously and I verified the heartbeat interval is not blocking the datanode. I guess it is possible that DN execution time + loopback RPC latency + SCM execution time takes up to 1 second sometimes, although this seems kind of high since it previously finished under 100ms ~29/30 times, and under 500ms ~139/140 times.
   
   TLDR I am reasonably confident the code is working correctly and this test fix is ok, but why the test fix was needed in the first place is still a bit of a mystery to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] iamabug commented on a change in pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
iamabug commented on a change in pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#discussion_r691858687



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/pipeline/TestPipelineClose.java
##########
@@ -242,7 +238,7 @@ public void testPipelineCloseWithLogFailure() throws IOException {
     xceiverRatis.handleNodeLogFailure(groupId, null);
 
     // verify SCM receives a pipeline action report "immediately"
-    Mockito.verify(pipelineActionTest, Mockito.timeout(100))
+    Mockito.verify(pipelineActionTest, Mockito.timeout(500))

Review comment:
       I notice that there is a global Timeout variable in this class, which applies to all the methods in this class.
   ```
    public class TestPipelineClose {
   
      /**
        * Set a timeout for each test.
        */
      @Rule
      public Timeout timeout = Timeout.seconds(300);
   ```
   
   Not sure if this has something to do with the failures.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] github-actions[bot] commented on pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#issuecomment-1055996775


   Thank you very much for the patch. I am closing this PR __temporarily__ as there was no activity recently and it is waiting for response from its author.
   
   It doesn't mean that this PR is not important or ignored: feel free to reopen the PR at any time.
   
   It only means that attention of committers is not required. We prefer to keep the review queue clean. This ensures PRs in need of review are more visible, which results in faster feedback for all PRs.
   
   If you need ANY help to finish this PR, please [contact the community](https://github.com/apache/hadoop-ozone#contact) on the mailing list or the slack channel."


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai commented on pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
adoroszlai commented on pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#issuecomment-902096192


   > 3 jobs with 10 runs each?
   
   3-4 runs * 10 iterations * `@RepeatedTest(10)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#discussion_r692152532



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/pipeline/TestPipelineClose.java
##########
@@ -242,7 +238,7 @@ public void testPipelineCloseWithLogFailure() throws IOException {
     xceiverRatis.handleNodeLogFailure(groupId, null);
 
     // verify SCM receives a pipeline action report "immediately"
-    Mockito.verify(pipelineActionTest, Mockito.timeout(100))
+    Mockito.verify(pipelineActionTest, Mockito.timeout(500))

Review comment:
       Sounds like a timeout of 1000ms is the way to go then. 
   
   I suspect the config should really all be setup before the mini-cluster is started to be sure the components get the correct config in time. Otherwise, the service could start before the config gets changed and start with the wrong value.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] JacksonYao287 edited a comment on pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
JacksonYao287 edited a comment on pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#issuecomment-900819682


   > @JacksonYao287 thanks for investigating. Are you using `org.junit.jupiter.api.RepeatedTest` annotation to run multiple times? When I try using this annotation, it does not initialize the test properly. I have been doing my repeated runs using Intellij test configuration locally, or with maven on CI. I will run another batch locally overnight and see if I can get any failures.
   
   i use the following annotations and do the test using Intellij
   ```
   +import org.junit.jupiter.api.AfterEach;
   +import org.junit.jupiter.api.BeforeEach;
   +import org.junit.jupiter.api.RepeatedTest;
   ```
   
   > Did you see the failure in the same place as reported originally? What timeout value did you use for your testing?
   
   yea, i see the same failure,  i set the timeout to 1000.
   
   > EDIT: I was able to repro a failure after 138 runs (took 1.5 hours). Better than before, but not totally fixed, I agree. I'll move this to draft for now.
   
   thanks for the work!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] JacksonYao287 edited a comment on pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
JacksonYao287 edited a comment on pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#issuecomment-900819682


   > @JacksonYao287 thanks for investigating. Are you using `org.junit.jupiter.api.RepeatedTest` annotation to run multiple times? When I try using this annotation, it does not initialize the test properly. I have been doing my repeated runs using Intellij test configuration locally, or with maven on CI. I will run another batch locally overnight and see if I can get any failures.
   
   i use the following annotations and do the test using Intellij
   ```
   +import org.junit.jupiter.api.AfterEach;
   +import org.junit.jupiter.api.BeforeEach;
   +import org.junit.jupiter.api.RepeatedTest;
   ```
   
   > Did you see the failure in the same place as reported originally? What timeout value did you use for your testing?
   
   yea, i see the same failure,  i set the timeout to 1000.
   
   > EDIT: I was able to repro a failure after 138 runs (took 1.5 hours). Better than before, but not totally fixed, I agree. I'll move this to draft for now.
   
   thanks for the work!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] errose28 commented on pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
errose28 commented on pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#issuecomment-902095233


   @adoroszlai how many times did you run the tests? I see you commented 3-400 repetitions, but looking at the links I only see 3 jobs with 10 runs each? Maybe I misread the results. @JacksonYao287 Said he was able to reproduce the failure even with a 1 second timeout. I have not yet tried this specific value myself but it looks like we need at least 200 clean runs to be sure based on my testing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] errose28 commented on pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
errose28 commented on pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#issuecomment-900767249


   @JacksonYao287 thanks for investigating. Are you using `org.junit.jupiter.api.RepeatedTest` annotation to run multiple times? When I try using this annotation, it does not initialize the test properly. I have been doing my repeated runs using Intellij test configuration locally, or with maven on CI. I will run another batch locally overnight and see if I can get any failures.
   
   Did you see the failure in the same place as reported originally? What timeout value did you use for your testing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] errose28 edited a comment on pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
errose28 edited a comment on pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#issuecomment-900767249


   @JacksonYao287 thanks for investigating. Are you using `org.junit.jupiter.api.RepeatedTest` annotation to run multiple times? When I try using this annotation, it does not initialize the test properly. I have been doing my repeated runs using Intellij test configuration locally, or with maven on CI. I will run another batch locally overnight and see if I can get any failures.
   
   Did you see the failure in the same place as reported originally? What timeout value did you use for your testing?
   
   EDIT: I was able to repro a failure after 138 runs (took 1.5 hours). Better than before, but not totally fixed, I agree. I'll move this to draft for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] github-actions[bot] closed pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #2549:
URL: https://github.com/apache/ozone/pull/2549


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai commented on a change in pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
adoroszlai commented on a change in pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#discussion_r692145103



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/pipeline/TestPipelineClose.java
##########
@@ -242,7 +238,7 @@ public void testPipelineCloseWithLogFailure() throws IOException {
     xceiverRatis.handleNodeLogFailure(groupId, null);
 
     // verify SCM receives a pipeline action report "immediately"
-    Mockito.verify(pipelineActionTest, Mockito.timeout(100))
+    Mockito.verify(pipelineActionTest, Mockito.timeout(500))

Review comment:
       > Should the conf not be set before the MiniOzoneCluster.build() method is called?
   
   With 1s timeout it seems to work fine both with and without that change (tested 3-400 repetitions).
   
   https://github.com/adoroszlai/hadoop-ozone/commits/HDDS-5604-repeat (`build()` moved after conf is set)
   https://github.com/adoroszlai/hadoop-ozone/commits/HDDS-5604-repro-1000 (`build()` in original order)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#discussion_r691050519



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/pipeline/TestPipelineClose.java
##########
@@ -242,7 +238,7 @@ public void testPipelineCloseWithLogFailure() throws IOException {
     xceiverRatis.handleNodeLogFailure(groupId, null);
 
     // verify SCM receives a pipeline action report "immediately"
-    Mockito.verify(pipelineActionTest, Mockito.timeout(100))
+    Mockito.verify(pipelineActionTest, Mockito.timeout(500))

Review comment:
       In `SCMDatanodeHeartbeatDispatcher.dispatch()` it seems to deal with the heartbeats, and within it, there is logic for PIPELINE_ACTIONS commands. If the commands are all sent via the heartbeat, perhaps we need a timeout of 1500. I am not sure if something on the DN can trigger the heartbeat to be send immediately if certain commands are queued on the DN though.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#discussion_r691045990



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/pipeline/TestPipelineClose.java
##########
@@ -242,7 +238,7 @@ public void testPipelineCloseWithLogFailure() throws IOException {
     xceiverRatis.handleNodeLogFailure(groupId, null);
 
     // verify SCM receives a pipeline action report "immediately"
-    Mockito.verify(pipelineActionTest, Mockito.timeout(100))
+    Mockito.verify(pipelineActionTest, Mockito.timeout(500))

Review comment:
       If it is still failing, I wonder does the timeout need to be even higher. In the init() method of this test class, I see:
   
   ```
     public void init() throws Exception {
       conf = new OzoneConfiguration();
       cluster = MiniOzoneCluster.newBuilder(conf).setNumDatanodes(3).build();
       conf.setTimeDuration(HddsConfigKeys.HDDS_HEARTBEAT_INTERVAL, 1000,
           TimeUnit.MILLISECONDS);
       pipelineDestroyTimeoutInMillis = 1000;
       conf.setTimeDuration(ScmConfigKeys.OZONE_SCM_PIPELINE_DESTROY_TIMEOUT,
           pipelineDestroyTimeoutInMillis, TimeUnit.MILLISECONDS);
      ...
   ```
   
   Should the conf not be set before the MiniOzoneCluster.build() method is called? Could the cluster potentially start without the conf changes?
   
   
   I cannot remember how this works. The DNs need to send a command to SCM that the pipeline should be closed here - does the DN do that over the DN heartbeat, or via some other mechanism? If the heartbeat is set to 1 second, then could this message take just over a second to arrive in the worse case?

##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/pipeline/TestPipelineClose.java
##########
@@ -242,7 +238,7 @@ public void testPipelineCloseWithLogFailure() throws IOException {
     xceiverRatis.handleNodeLogFailure(groupId, null);
 
     // verify SCM receives a pipeline action report "immediately"
-    Mockito.verify(pipelineActionTest, Mockito.timeout(100))
+    Mockito.verify(pipelineActionTest, Mockito.timeout(500))

Review comment:
       In `SCMDatanodeHeartbeatDispatcher.dispatch()` it seems to deal with the heartbeats, and within it, there is logic for PIPELINE_ACTIONS commands. If the commands are all sent via the heartbeat, perhaps we need a timeout of 1500. I am not sure if something on the DN can trigger the heartbeat to be send immediately if certain commands are queued on the DN though.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] errose28 commented on pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
errose28 commented on pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#issuecomment-902100715


   Ah I missed the repeated test added in the diff. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] JacksonYao287 commented on pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
JacksonYao287 commented on pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#issuecomment-900819682


   > @JacksonYao287 thanks for investigating. Are you using `org.junit.jupiter.api.RepeatedTest` annotation to run multiple times? When I try using this annotation, it does not initialize the test properly. I have been doing my repeated runs using Intellij test configuration locally, or with maven on CI. I will run another batch locally overnight and see if I can get any failures.
   
   i use the following annotations and do the test using Intellij
   ```
   +import org.junit.jupiter.api.AfterEach;
   +import org.junit.jupiter.api.BeforeEach;
   +import org.junit.jupiter.api.RepeatedTest;
   ```
   
   > Did you see the failure in the same place as reported originally? What timeout value did you use for your testing?
   yea, i see the same failure,  i set the timeout to 1000.
   
   > EDIT: I was able to repro a failure after 138 runs (took 1.5 hours). Better than before, but not totally fixed, I agree. I'll move this to draft for now.
   
   thanks for the work!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai commented on a change in pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
adoroszlai commented on a change in pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#discussion_r692207106



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/pipeline/TestPipelineClose.java
##########
@@ -242,7 +238,7 @@ public void testPipelineCloseWithLogFailure() throws IOException {
     xceiverRatis.handleNodeLogFailure(groupId, null);
 
     // verify SCM receives a pipeline action report "immediately"
-    Mockito.verify(pipelineActionTest, Mockito.timeout(100))
+    Mockito.verify(pipelineActionTest, Mockito.timeout(500))

Review comment:
       I've updated the PR with both of these changes (setting conf and increasing timeout).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai commented on a change in pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
adoroszlai commented on a change in pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#discussion_r692185997



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/pipeline/TestPipelineClose.java
##########
@@ -242,7 +238,7 @@ public void testPipelineCloseWithLogFailure() throws IOException {
     xceiverRatis.handleNodeLogFailure(groupId, null);
 
     // verify SCM receives a pipeline action report "immediately"
-    Mockito.verify(pipelineActionTest, Mockito.timeout(100))
+    Mockito.verify(pipelineActionTest, Mockito.timeout(500))

Review comment:
       > I suspect the config should really all be setup before the mini-cluster is started
   
   Agree.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] errose28 commented on pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
errose28 commented on pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#issuecomment-900767249


   @JacksonYao287 thanks for investigating. Are you using `org.junit.jupiter.api.RepeatedTest` annotation to run multiple times? When I try using this annotation, it does not initialize the test properly. I have been doing my repeated runs using Intellij test configuration locally, or with maven on CI. I will run another batch locally overnight and see if I can get any failures.
   
   Did you see the failure in the same place as reported originally? What timeout value did you use for your testing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#discussion_r691045990



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/pipeline/TestPipelineClose.java
##########
@@ -242,7 +238,7 @@ public void testPipelineCloseWithLogFailure() throws IOException {
     xceiverRatis.handleNodeLogFailure(groupId, null);
 
     // verify SCM receives a pipeline action report "immediately"
-    Mockito.verify(pipelineActionTest, Mockito.timeout(100))
+    Mockito.verify(pipelineActionTest, Mockito.timeout(500))

Review comment:
       If it is still failing, I wonder does the timeout need to be even higher. In the init() method of this test class, I see:
   
   ```
     public void init() throws Exception {
       conf = new OzoneConfiguration();
       cluster = MiniOzoneCluster.newBuilder(conf).setNumDatanodes(3).build();
       conf.setTimeDuration(HddsConfigKeys.HDDS_HEARTBEAT_INTERVAL, 1000,
           TimeUnit.MILLISECONDS);
       pipelineDestroyTimeoutInMillis = 1000;
       conf.setTimeDuration(ScmConfigKeys.OZONE_SCM_PIPELINE_DESTROY_TIMEOUT,
           pipelineDestroyTimeoutInMillis, TimeUnit.MILLISECONDS);
      ...
   ```
   
   Should the conf not be set before the MiniOzoneCluster.build() method is called? Could the cluster potentially start without the conf changes?
   
   
   I cannot remember how this works. The DNs need to send a command to SCM that the pipeline should be closed here - does the DN do that over the DN heartbeat, or via some other mechanism? If the heartbeat is set to 1 second, then could this message take just over a second to arrive in the worse case?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] avijayanhwx commented on pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
avijayanhwx commented on pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#issuecomment-915323266


   I see that a lot of analysis has gone into this already. I am wondering if we can change the test a little bit to help remove the rare failure. Since this is a mocked up test (explicitly adding pipeline actions handler, and waiting for capture), maybe we can remove the RPC call out of the test. If we can merely capture the heartbeat on the DN side to make sure it has the pipeline action that may be good enough. Thoughts? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] errose28 edited a comment on pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
errose28 edited a comment on pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#issuecomment-900767249


   @JacksonYao287 thanks for investigating. Are you using `org.junit.jupiter.api.RepeatedTest` annotation to run multiple times? When I try using this annotation, it does not initialize the test properly. I have been doing my repeated runs using Intellij test configuration locally, or with maven on CI. I will run another batch locally overnight and see if I can get any failures.
   
   Did you see the failure in the same place as reported originally? What timeout value did you use for your testing?
   
   EDIT: I was able to repro a failure after 138 runs (took 1.5 hours). Better than before, but not totally fixed, I agree. I'll move this to draft for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] JacksonYao287 commented on pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
JacksonYao287 commented on pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#issuecomment-900819682


   > @JacksonYao287 thanks for investigating. Are you using `org.junit.jupiter.api.RepeatedTest` annotation to run multiple times? When I try using this annotation, it does not initialize the test properly. I have been doing my repeated runs using Intellij test configuration locally, or with maven on CI. I will run another batch locally overnight and see if I can get any failures.
   
   i use the following annotations and do the test using Intellij
   ```
   +import org.junit.jupiter.api.AfterEach;
   +import org.junit.jupiter.api.BeforeEach;
   +import org.junit.jupiter.api.RepeatedTest;
   ```
   
   > Did you see the failure in the same place as reported originally? What timeout value did you use for your testing?
   yea, i see the same failure,  i set the timeout to 1000.
   
   > EDIT: I was able to repro a failure after 138 runs (took 1.5 hours). Better than before, but not totally fixed, I agree. I'll move this to draft for now.
   
   thanks for the work!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai commented on pull request #2549: HDDS-5604. Intermittent failure in TestPipelineClose.

Posted by GitBox <gi...@apache.org>.
adoroszlai commented on pull request #2549:
URL: https://github.com/apache/ozone/pull/2549#issuecomment-1032714184


   /pending


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org