You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/07/08 00:21:48 UTC

[GitHub] [druid] jihoonson opened a new pull request #11419: Add the error message in taskStatus for task failures in overlord

jihoonson opened a new pull request #11419:
URL: https://github.com/apache/druid/pull/11419


   ### Description
   
   Ingestion tasks can fail in the overlord for various reasons. Currently, these tasks are marked as failed but their taskStatus doesn't report the error message properly in most cases. These can be categorized into 3 cases as below.
   
   - Canceled by users. This case is not really task failures, but those tasks are marked as failed.
   - Internal errors. Any sort of internal errors such as unknown tasks or task assignment timeout are included in this case.
   - Sanity checks. This case is similar to internal errors, but the difference is that the cases tested by these sanity checks should never happen in any case.
   
   Because these tasks fail in the overlord, neither the task logs nor task reports are available, and thus it's quite hard for users to debug when this happens unless you have knowledge of the ingestion system internals. This PR addresses this concern by adding user-friendly error messages in taskStatus. Since the error message field is truncated, it should contain a simple message that leads users to reading logs for further investigation. The 3 cases above should change as below after this PR.
   
   - Canceled tasks should report in taskStatus that they are canceled.
   - Internal errors should report a simple user-friendly error message in taskStatus.
   - Sanity checks should also report a simple user-friendly error message in taskStatus.
   
   <hr>
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist below are strictly necessary, but it would be very helpful if you at least self-review the PR. -->
   
   This PR has:
   - [z] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [z] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] loquisgon commented on a change in pull request #11419: Add the error message in taskStatus for task failures in overlord

Posted by GitBox <gi...@apache.org>.
loquisgon commented on a change in pull request #11419:
URL: https://github.com/apache/druid/pull/11419#discussion_r667171845



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/overlord/RemoteTaskRunner.java
##########
@@ -540,7 +540,9 @@ public boolean isWorkerRunningTask(ZkWorker worker, String taskId)
     } else if ((completeTask = completeTasks.get(task.getId())) != null) {
       return completeTask.getResult();
     } else {
-      return addPendingTask(task).getResult();
+      RemoteTaskRunnerWorkItem workItem = addPendingTask(task);
+      runPendingTasks();

Review comment:
       I find it confusing and error prone that after adding a task(s) now you have to explicitly call run..., maybe ad a comment to make it clear that it is required (else?)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #11419: Add the error message in taskStatus for task failures in overlord

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #11419:
URL: https://github.com/apache/druid/pull/11419#issuecomment-880983876


   @clintropolis @capistrant @loquisgon thanks for the review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #11419: Add the error message in taskStatus for task failures in overlord

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #11419:
URL: https://github.com/apache/druid/pull/11419#issuecomment-877474597


   > [ERROR] org.apache.druid.indexing.overlord.RemoteTaskRunnerTest.testBlacklistZKWorkers25Percent(org.apache.druid.indexing.overlord.RemoteTaskRunnerTest)
   [ERROR]   Run 1: RemoteTaskRunnerTest.testBlacklistZKWorkers25Percent:902 expected:<2> but was:<3>
   [ERROR]   Run 2: RemoteTaskRunnerTest.testBlacklistZKWorkers25Percent:902 expected:<1> but was:<2>
   [ERROR]   Run 3: RemoteTaskRunnerTest.testBlacklistZKWorkers25Percent:902 expected:<2> but was:<3>
   [ERROR]   Run 4: RemoteTaskRunnerTest.testBlacklistZKWorkers25Percent:902 expected:<1> but was:<2>
   
   Looking at this failure, it seems possible that the `blackListedWorkers` could be wrong if the same task failure event was counted more than once. I do not know whether this was the issue I saw on Travis, but fixed it anyway. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #11419: Add the error message in taskStatus for task failures in overlord

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #11419:
URL: https://github.com/apache/druid/pull/11419#issuecomment-880256994


   @loquisgon thanks for the review. @capistrant @clintropolis do you want to have a look at the last change I made after you approved? I will merge this PR otherwise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #11419: Add the error message in taskStatus for task failures in overlord

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #11419:
URL: https://github.com/apache/druid/pull/11419#issuecomment-876799420


   @capistrant thanks for the review. Fixed those test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson edited a comment on pull request #11419: Add the error message in taskStatus for task failures in overlord

Posted by GitBox <gi...@apache.org>.
jihoonson edited a comment on pull request #11419:
URL: https://github.com/apache/druid/pull/11419#issuecomment-877474597


   > [ERROR] org.apache.druid.indexing.overlord.RemoteTaskRunnerTest.testBlacklistZKWorkers25Percent(org.apache.druid.indexing.overlord.RemoteTaskRunnerTest)
   [ERROR]   Run 1: RemoteTaskRunnerTest.testBlacklistZKWorkers25Percent:902 expected:<2> but was:<3>
   [ERROR]   Run 2: RemoteTaskRunnerTest.testBlacklistZKWorkers25Percent:902 expected:<1> but was:<2>
   [ERROR]   Run 3: RemoteTaskRunnerTest.testBlacklistZKWorkers25Percent:902 expected:<2> but was:<3>
   [ERROR]   Run 4: RemoteTaskRunnerTest.testBlacklistZKWorkers25Percent:902 expected:<1> but was:<2>
   
   Looking at this failure, it seems possible that the `blackListedWorkers` could be wrong if the same task failure event was counted more than once. I do not know whether this is the issue I saw on Travis, but fixed it anyway. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson merged pull request #11419: Add the error message in taskStatus for task failures in overlord

Posted by GitBox <gi...@apache.org>.
jihoonson merged pull request #11419:
URL: https://github.com/apache/druid/pull/11419


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #11419: Add the error message in taskStatus for task failures in overlord

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #11419:
URL: https://github.com/apache/druid/pull/11419#issuecomment-877717201


   I retriggered 4 jobs for indexing module test 4 times each, and all passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] loquisgon commented on a change in pull request #11419: Add the error message in taskStatus for task failures in overlord

Posted by GitBox <gi...@apache.org>.
loquisgon commented on a change in pull request #11419:
URL: https://github.com/apache/druid/pull/11419#discussion_r667171845



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/overlord/RemoteTaskRunner.java
##########
@@ -540,7 +540,9 @@ public boolean isWorkerRunningTask(ZkWorker worker, String taskId)
     } else if ((completeTask = completeTasks.get(task.getId())) != null) {
       return completeTask.getResult();
     } else {
-      return addPendingTask(task).getResult();
+      RemoteTaskRunnerWorkItem workItem = addPendingTask(task);
+      runPendingTasks();

Review comment:
       I find it confusing and error prone that after adding a task(s) now you have to explicitly call run..., maybe ad a comment to make it clear that it is required (else?)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #11419: Add the error message in taskStatus for task failures in overlord

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #11419:
URL: https://github.com/apache/druid/pull/11419#issuecomment-877454483


   3 jobs for indexing module test failed on Travis. They all failed because of `RemoteTaskRunnerTest.testBlacklistZKWorkers25Percent()` and `RemoteTaskRunnerTest.testBlacklistZKWorkers50Percent()` failures. I don't remember whether these tests are flaky, but this PR doesn't seem to make them flaky either. As I'm not certain why they are flaky, I will retrigger them a couple of times just in case to see whether those failures relate to the change in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] loquisgon commented on pull request #11419: Add the error message in taskStatus for task failures in overlord

Posted by GitBox <gi...@apache.org>.
loquisgon commented on pull request #11419:
URL: https://github.com/apache/druid/pull/11419#issuecomment-879470102


   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] capistrant commented on pull request #11419: Add the error message in taskStatus for task failures in overlord

Posted by GitBox <gi...@apache.org>.
capistrant commented on pull request #11419:
URL: https://github.com/apache/druid/pull/11419#issuecomment-876795254


   CI failures look like legit unit test failures. But overall this is looking good. The added context will be a bighelp


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org