You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "georgew5656 (via GitHub)" <gi...@apache.org> on 2023/03/31 21:10:44 UTC

[GitHub] [druid] georgew5656 opened a new pull request, #14010: Fix issues with null pointers on jobResponse

georgew5656 opened a new pull request, #14010:
URL: https://github.com/apache/druid/pull/14010

   ### Description
   I was doing some additional testing on my change in https://github.com/apache/druid/pull/14001, and realized that I missed some logic to handle the null job in  getJobDuration. This causes a null pointer exception to be thrown when a job is manually shut down.
   
   #### Release note
   - FIx bug with manual task shutdown in the KubernetesTaskRunner
   - 
   ##### Key changed/added classes in this PR
   - Explicitly pass null as the job parameter to the JobResponse constructor
   - Add a log for when the job has been manually cleaned up
   - Check job null status when grabbing name for logs
   
   - [X] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [X] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [X] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a diff in pull request #14010: Fix issues with null pointers on jobResponse

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.
abhishekagarwal87 commented on code in PR #14010:
URL: https://github.com/apache/druid/pull/14010#discussion_r1155975273


##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java:
##########
@@ -111,7 +111,8 @@ public JobResponse waitForJobCompletion(K8sTaskId taskId, long howLong, TimeUnit
                           unit
                       );
       if (job == null) {
-        return new JobResponse(job, PeonPhase.FAILED);
+        log.info("K8s job for task was not found %s", taskId);
+        return new JobResponse(null, PeonPhase.FAILED);

Review Comment:
   you should also update the JobResponse ctr to mark the first argument @Nullable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on pull request #14010: Fix issues with null pointers on jobResponse

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.
abhishekagarwal87 commented on PR #14010:
URL: https://github.com/apache/druid/pull/14010#issuecomment-1493692484

   When you say, job is manually cleaned up, do you mean someone cancelling the task from console (or from druid API) or someone terminating the pod manually? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis merged pull request #14010: Fix issues with null pointers on jobResponse

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis merged PR #14010:
URL: https://github.com/apache/druid/pull/14010


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] georgew5656 commented on a diff in pull request #14010: Fix issues with null pointers on jobResponse

Posted by "georgew5656 (via GitHub)" <gi...@apache.org>.
georgew5656 commented on code in PR #14010:
URL: https://github.com/apache/druid/pull/14010#discussion_r1155414228


##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java:
##########
@@ -111,7 +111,8 @@ public JobResponse waitForJobCompletion(K8sTaskId taskId, long howLong, TimeUnit
                           unit
                       );
       if (job == null) {
-        return new JobResponse(job, PeonPhase.FAILED);
+        log.info("K8s job for task was not found %s", taskId);
+        return new JobResponse(null, PeonPhase.FAILED);

Review Comment:
   it still gets reported by https://github.com/apache/druid/blob/master/extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java#L184 since its a PeonPhase.FAILED status



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a diff in pull request #14010: Fix issues with null pointers on jobResponse

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.
abhishekagarwal87 commented on code in PR #14010:
URL: https://github.com/apache/druid/pull/14010#discussion_r1156708286


##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java:
##########
@@ -111,7 +111,8 @@ public JobResponse waitForJobCompletion(K8sTaskId taskId, long howLong, TimeUnit
                           unit
                       );
       if (job == null) {
-        return new JobResponse(job, PeonPhase.FAILED);
+        log.info("K8s job for task was not found %s", taskId);

Review Comment:
   This is better. Though "task was deleted" is more of an implementation detail. "Task was canceled or pod got terminated" seems better. So if pods were reaped then users will know immediately. Or if they canceled the job, they will be able to put two and two together themselves. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] nlippis commented on a diff in pull request #14010: Fix issues with null pointers on jobResponse

Posted by "nlippis (via GitHub)" <gi...@apache.org>.
nlippis commented on code in PR #14010:
URL: https://github.com/apache/druid/pull/14010#discussion_r1154906934


##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java:
##########
@@ -111,7 +111,8 @@ public JobResponse waitForJobCompletion(K8sTaskId taskId, long howLong, TimeUnit
                           unit
                       );
       if (job == null) {
-        return new JobResponse(job, PeonPhase.FAILED);
+        log.info("K8s job for task was not found %s", taskId);
+        return new JobResponse(null, PeonPhase.FAILED);

Review Comment:
   Though this is an improvement over the current NPE, if we do this then a status won't ever be recorded for the task and it may be attempted to be rerun.  We should report the status using the taskid passed into this function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a diff in pull request #14010: Fix issues with null pointers on jobResponse

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.
abhishekagarwal87 commented on code in PR #14010:
URL: https://github.com/apache/druid/pull/14010#discussion_r1155973634


##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java:
##########
@@ -111,7 +111,8 @@ public JobResponse waitForJobCompletion(K8sTaskId taskId, long howLong, TimeUnit
                           unit
                       );
       if (job == null) {
-        return new JobResponse(job, PeonPhase.FAILED);
+        log.info("K8s job for task was not found %s", taskId);

Review Comment:
   ```suggestion
           log.info("K8s job for the task [%s] was not found. It can happen if the task was canceled", taskId);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] georgew5656 commented on a diff in pull request #14010: Fix issues with null pointers on jobResponse

Posted by "georgew5656 (via GitHub)" <gi...@apache.org>.
georgew5656 commented on code in PR #14010:
URL: https://github.com/apache/druid/pull/14010#discussion_r1157274017


##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java:
##########
@@ -111,7 +111,8 @@ public JobResponse waitForJobCompletion(K8sTaskId taskId, long howLong, TimeUnit
                           unit
                       );
       if (job == null) {
-        return new JobResponse(job, PeonPhase.FAILED);
+        log.info("K8s job for task was not found %s", taskId);

Review Comment:
   i put deleted because the task shows as cancelled when you shut it down through the console but failed if you e.g. delete the k8s job



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a diff in pull request #14010: Fix issues with null pointers on jobResponse

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.
abhishekagarwal87 commented on code in PR #14010:
URL: https://github.com/apache/druid/pull/14010#discussion_r1155502001


##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java:
##########
@@ -111,7 +111,8 @@ public JobResponse waitForJobCompletion(K8sTaskId taskId, long howLong, TimeUnit
                           unit
                       );
       if (job == null) {
-        return new JobResponse(job, PeonPhase.FAILED);
+        log.info("K8s job for task was not found %s", taskId);

Review Comment:
   let's also log the likely reason for this in the log line itself. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] georgew5656 commented on a diff in pull request #14010: Fix issues with null pointers on jobResponse

Posted by "georgew5656 (via GitHub)" <gi...@apache.org>.
georgew5656 commented on code in PR #14010:
URL: https://github.com/apache/druid/pull/14010#discussion_r1156527106


##########
extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java:
##########
@@ -111,7 +111,8 @@ public JobResponse waitForJobCompletion(K8sTaskId taskId, long howLong, TimeUnit
                           unit
                       );
       if (job == null) {
-        return new JobResponse(job, PeonPhase.FAILED);
+        log.info("K8s job for task was not found %s", taskId);

Review Comment:
   @abhishekagarwal87  I wrongly took this to mean you wanted the reason to actually be surfaced in the error message so I added it there in addition to updating this log line.
   
   the error looks like this now
   <img width="1209" alt="Screenshot 2023-04-03 at 6 58 23 PM" src="https://user-images.githubusercontent.com/17736581/229644725-cf663bb3-3f67-412d-a62c-e068d9a004df.png">
   
   lmk if you think thats too much info to surface in the console and I can remove that commit, otherwise I addressed the rest of the comments
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] georgew5656 commented on pull request #14010: Fix issues with null pointers on jobResponse

Posted by "georgew5656 (via GitHub)" <gi...@apache.org>.
georgew5656 commented on PR #14010:
URL: https://github.com/apache/druid/pull/14010#issuecomment-1494325255

   > When you say, job is manually cleaned up, do you mean someone cancelling the task from console (or from druid API) or someone terminating the pod manually?
   
   either cancelling the task or terminating the k8s job manually will do this (terminating the pod will cause the job to restart it iirc)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org