You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "georgew5656 (via GitHub)" <gi...@apache.org> on 2023/03/29 15:02:12 UTC

[GitHub] [druid] georgew5656 opened a new pull request, #14001: Fix bug in k8s task runner in handling deleted jobs

georgew5656 opened a new pull request, #14001:
URL: https://github.com/apache/druid/pull/14001

   ### Description
   With the KubernetesTaskRunner, if a task is manually shutdown while running or the job is manually deleted, the thread responsible for overseeing the job gets stuck in a loop because the fabric8 client sends one event to it that the job is null when the job is deleted, but this doesn't pass the condition.
   
   This means that the thread is stuck waiting on a fabric8 event (the job being successful) that will never come up until maxTaskDuration (default 4 hours). If a user of the extension is trying to use a limited taskqueue maxSize, this can cause problems as the k8s executor pool is unable to pick up additional tasks (since threads are stuck waiting on the old tasks that have already been deleted).
   
   An alternative method might be to have the shutdown method in the K8s Task runner cancel running futures so they don't get stuck when the job is deleted, but this would not address the situation where a k8s job is manually deleted.
   
   #### Release notes
   Fix a bug with hanging threads in the K8s Task Scheduler
   
   ##### Key changed/added classes in this PR
   Update waitForJobCompletion to exit out with a failed status if the job has been deleted. This function is only called after a job has been confirmed to have been launched (either right after launchJobAndWaitForStart has been called or after a job that is already running has been run again), so there should be no issues with race conditions here.
   
   This PR has:
   
   - [ X] been self-reviewed.
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [X] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [X] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] dclim closed pull request #14001: Fix bug in k8s task runner in handling deleted jobs

Posted by "dclim (via GitHub)" <gi...@apache.org>.
dclim closed pull request #14001: Fix bug in k8s task runner in handling deleted jobs
URL: https://github.com/apache/druid/pull/14001


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a diff in pull request #14001: Fix bug in k8s task runner in handling deleted jobs

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.
abhishekagarwal87 commented on code in PR #14001:
URL: https://github.com/apache/druid/pull/14001#discussion_r1152247505


##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java:
##########
@@ -106,10 +106,13 @@ public JobResponse waitForJobCompletion(K8sTaskId taskId, long howLong, TimeUnit
                       .inNamespace(namespace)
                       .withName(taskId.getK8sTaskId())
                       .waitUntilCondition(
-                          x -> x != null && x.getStatus() != null && x.getStatus().getActive() == null,
+                          x -> x == null || x.getStatus() != null && x.getStatus().getActive() == null,

Review Comment:
   ```suggestion
                             x -> (x == null) || (x.getStatus() != null && x.getStatus().getActive() == null),
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 merged pull request #14001: Fix bug in k8s task runner in handling deleted jobs

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.
abhishekagarwal87 merged PR #14001:
URL: https://github.com/apache/druid/pull/14001


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org