You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/10/18 06:00:24 UTC

[GitHub] [druid] maytasm opened a new pull request #10517: Fix compaction integration test CI timeout

maytasm opened a new pull request #10517:
URL: https://github.com/apache/druid/pull/10517


   Fix compaction integration test CI timeout
   
   ### Description
   
   Compaction integration test intermittently timeout (Travis job stuck until timeout of 50 minutes and/or Travis job terminates after 10 mins of not receiving new output) due to one of the test submitting/running too many tasks at the same time. Specifically, the test submits range partitioning compaction tasks with 2 subtasks each (for a total of 6 tasks). This causes the cluster to intermittently fails. I am not sure what is the _real_ maximum number of tasks the Druid cluster running in Travis can handle. It probably also depends on the type of tasks too. Anyhow, changed the task to only have 1 subtask each (for a total of 4 tasks) and the intermittent failure is now fixed.
   
   Please see below for Travis build with 72 compaction integration test.
   - Without any change from this PR. We can see that about 25% failed due to the above issue. https://travis-ci.org/github/maytasm/druid/builds/735867518
   - Removing the test that submits range partitioning compaction tasks with 2 subtasks each (for a total of 6 tasks). We can see that 100% passed. https://travis-ci.org/github/maytasm/druid/builds/736172442
   - Changing the range partitioning compaction tasks to 1 subtask each. We can see that 100% passed. https://travis-ci.org/github/maytasm/druid/builds/736451204
   
   This PR has:
   - [x] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [x] been tested in a test Druid cluster.
   
   
   - Fix compaction integration test CI timeout
   - Update Integration test README on root cause of the test failure.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on pull request #10517: Fix compaction integration test CI timeout

Posted by GitBox <gi...@apache.org>.
maytasm commented on pull request #10517:
URL: https://github.com/apache/druid/pull/10517#issuecomment-712390026


   > @maytasm thank you for working on this issue! Have you run the same Travis job more than one time to see if it was really fixed or it just happened to pass?
   
   Yes. I have changed the travis file to have the Compaction job runs 72 times (that change was in earlier commits of the PR). The results of those 72 runs can be seen in the travis links in the PR description. Tldr; without any change from this PR, I got 18 failed out of 72 runs. After the change/fix from this PR, I got 0 failed out of 72 runs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm merged pull request #10517: Fix compaction integration test CI timeout

Posted by GitBox <gi...@apache.org>.
maytasm merged pull request #10517:
URL: https://github.com/apache/druid/pull/10517


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on a change in pull request #10517: Fix compaction integration test CI timeout

Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #10517:
URL: https://github.com/apache/druid/pull/10517#discussion_r509621671



##########
File path: integration-tests/src/test/java/org/apache/druid/tests/indexer/AbstractIndexerTest.java
##########
@@ -69,6 +73,16 @@ protected Closeable unloader(final String dataSource)
 
   protected void unloadAndKillData(final String dataSource)
   {
+    // Get all failed task logs
+    List<TaskResponseObject> allTasks = indexer.getCompleteTasksForDataSource(dataSource);
+    for (TaskResponseObject task : allTasks) {
+      if (task.getStatus().isFailure()) {
+        LOG.info("------- Found failed task. Start log:");
+        LOG.info(indexer.getTaskLog(task.getId()));

Review comment:
       Maybe the offset could be `min(88000, 80% of total log size)`, where 88000 is an approximate size of the startup logs.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on a change in pull request #10517: Fix compaction integration test CI timeout

Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #10517:
URL: https://github.com/apache/druid/pull/10517#discussion_r509645665



##########
File path: integration-tests/src/test/java/org/apache/druid/tests/indexer/AbstractIndexerTest.java
##########
@@ -69,6 +73,16 @@ protected Closeable unloader(final String dataSource)
 
   protected void unloadAndKillData(final String dataSource)
   {
+    // Get all failed task logs
+    List<TaskResponseObject> allTasks = indexer.getCompleteTasksForDataSource(dataSource);
+    for (TaskResponseObject task : allTasks) {
+      if (task.getStatus().isFailure()) {
+        LOG.info("------- Found failed task. Start log:");
+        LOG.info(indexer.getTaskLog(task.getId()));

Review comment:
       Oh yeah, forgot that it can be a negative. I think it should be a bit larger than that though. I usually don't find anything interesting from the logs which the UI shows by default.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #10517: Fix compaction integration test CI timeout

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #10517:
URL: https://github.com/apache/druid/pull/10517#discussion_r509619819



##########
File path: integration-tests/src/test/java/org/apache/druid/tests/indexer/AbstractIndexerTest.java
##########
@@ -69,6 +73,16 @@ protected Closeable unloader(final String dataSource)
 
   protected void unloadAndKillData(final String dataSource)
   {
+    // Get all failed task logs
+    List<TaskResponseObject> allTasks = indexer.getCompleteTasksForDataSource(dataSource);
+    for (TaskResponseObject task : allTasks) {
+      if (task.getStatus().isFailure()) {
+        LOG.info("------- Found failed task. Start log:");
+        LOG.info(indexer.getTaskLog(task.getId()));

Review comment:
       Any suggestion on the offset value? Another idea I have is to get the errorMsg field in the report API call




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #10517: Fix compaction integration test CI timeout

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #10517:
URL: https://github.com/apache/druid/pull/10517#discussion_r507021851



##########
File path: integration-tests/src/test/java/org/apache/druid/tests/coordinator/duty/ITAutoCompactionTest.java
##########
@@ -77,7 +77,7 @@
   @BeforeMethod
   public void setup() throws Exception
   {
-    // Set comapction slot to 10
+    // Set comapction slot to 5

Review comment:
       This comment was stale. The compaction slot was always 5. It was not changed in this PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #10517: Fix compaction integration test CI timeout

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #10517:
URL: https://github.com/apache/druid/pull/10517#discussion_r509819391



##########
File path: integration-tests/src/test/java/org/apache/druid/tests/indexer/AbstractIndexerTest.java
##########
@@ -69,6 +73,16 @@ protected Closeable unloader(final String dataSource)
 
   protected void unloadAndKillData(final String dataSource)
   {
+    // Get all failed task logs
+    List<TaskResponseObject> allTasks = indexer.getCompleteTasksForDataSource(dataSource);
+    for (TaskResponseObject task : allTasks) {
+      if (task.getStatus().isFailure()) {
+        LOG.info("------- Found failed task. Start log:");
+        LOG.info(indexer.getTaskLog(task.getId()));

Review comment:
       Set the offset to -88000 and also added a log of the errorMsg so that we are sure we will definitely have the error message




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] maytasm commented on a change in pull request #10517: Fix compaction integration test CI timeout

Posted by GitBox <gi...@apache.org>.
maytasm commented on a change in pull request #10517:
URL: https://github.com/apache/druid/pull/10517#discussion_r509631685



##########
File path: integration-tests/src/test/java/org/apache/druid/tests/indexer/AbstractIndexerTest.java
##########
@@ -69,6 +73,16 @@ protected Closeable unloader(final String dataSource)
 
   protected void unloadAndKillData(final String dataSource)
   {
+    // Get all failed task logs
+    List<TaskResponseObject> allTasks = indexer.getCompleteTasksForDataSource(dataSource);
+    for (TaskResponseObject task : allTasks) {
+      if (task.getStatus().isFailure()) {
+        LOG.info("------- Found failed task. Start log:");
+        LOG.info(indexer.getTaskLog(task.getId()));

Review comment:
       Or we can do offset=-16000 which is the same as the Console UI Log window popup thing




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on a change in pull request #10517: Fix compaction integration test CI timeout

Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #10517:
URL: https://github.com/apache/druid/pull/10517#discussion_r509618102



##########
File path: integration-tests/src/test/java/org/apache/druid/tests/indexer/AbstractIndexerTest.java
##########
@@ -69,6 +73,16 @@ protected Closeable unloader(final String dataSource)
 
   protected void unloadAndKillData(final String dataSource)
   {
+    // Get all failed task logs
+    List<TaskResponseObject> allTasks = indexer.getCompleteTasksForDataSource(dataSource);
+    for (TaskResponseObject task : allTasks) {
+      if (task.getStatus().isFailure()) {
+        LOG.info("------- Found failed task. Start log:");
+        LOG.info(indexer.getTaskLog(task.getId()));

Review comment:
       How about showing a certain amount of logs at the end instead of printing the whole? It will likely kill the Travis job by hitting the max number of logs per job. You can add a `offset` query parameter when you call the Overlord API.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on a change in pull request #10517: Fix compaction integration test CI timeout

Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #10517:
URL: https://github.com/apache/druid/pull/10517#discussion_r509621671



##########
File path: integration-tests/src/test/java/org/apache/druid/tests/indexer/AbstractIndexerTest.java
##########
@@ -69,6 +73,16 @@ protected Closeable unloader(final String dataSource)
 
   protected void unloadAndKillData(final String dataSource)
   {
+    // Get all failed task logs
+    List<TaskResponseObject> allTasks = indexer.getCompleteTasksForDataSource(dataSource);
+    for (TaskResponseObject task : allTasks) {
+      if (task.getStatus().isFailure()) {
+        LOG.info("------- Found failed task. Start log:");
+        LOG.info(indexer.getTaskLog(task.getId()));

Review comment:
       Maybe the offset could be `max(88000, 80% of total log size)`, where 88000 is an approximate size of the startup logs.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on a change in pull request #10517: Fix compaction integration test CI timeout

Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #10517:
URL: https://github.com/apache/druid/pull/10517#discussion_r507928409



##########
File path: integration-tests/README.md
##########
@@ -385,3 +386,12 @@ Please be mindful when adding tests to the "AllParallelizedTests" test tag that
 other tests from the same class at the same time. i.e. test does not modify/restart/stop the druid cluster or other dependency containers,
 test does not use excessive memory starving other concurent task, test does not modify and/or use other task, 
 supervisor, datasource it did not create. 
+
+### Limitation of Druid cluster in Travis environment
+
+By default, integration tests are run in Travis environment on commits made to open PR. These integration test jobs are
+required to pass for a PR to be elligible to be merged. Here are known issues and limitations to the Druid docker cluster
+running in Travis machine that may cause the tests to fail:
+- Number of concurrent running tasks. Although the default Druid cluster config sets the maximum number of tasks (druid.worker.capacity) to 10,
+the actual maximum can be lower depending on the type of the tasks. For example, running 2 range partitioning compaction tasks with 2 subtasks each 
+(for a total of 6 tasks) concurrently can cause the cluster to intermittently fails.

Review comment:
       typo: `fails` -> `fail`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org