You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/02/04 16:38:02 UTC

[GitHub] [druid] klarose opened a new pull request #12233: kubernetes: restart watch on null response

klarose opened a new pull request #12233:
URL: https://github.com/apache/druid/pull/12233


   
   
   Fixes #11520.
   
   
   
   ### Description
   
   Kubernetes watches allow a client to efficiently processes changes to
   resources. However, they have some idiosyncrasies. In particular, they
   can error out for various reasons leading to what would normally be seen
   as an invalid result.
   
   The Druid kubernetes node discovery subsystem does not handle a certain
   case properly. The watch can return an item with a null object.  These
   leads to a null pointer exception. When this happens, the provider needs
   to restart the watch, because rerunning the watch from the same resource
   version leads to the same result: yet another null pointer exception.
   
   This commit changes the provider to handle null objects by restarting
   the watch.
   
   A clean alternative to this would be to change the provider to use an [Informer](https://github.com/kubernetes-client/java/blob/master/examples/examples-release-14/src/main/java/io/kubernetes/client/examples/InformerExample.java). I suspect this would simplify the code substantially while handling most if not all of the corner cases we could run into by using a bare watch. I don't quite have the time to undertake a large change like that, though, so I'm submitting this quick fix so that we can at least resolve the most common issue that seems to affect the kubernetes provider.
   
   
   <hr>
   
   ##### Key changed/added classes in this PR
   * DefaultK8sApiClient: now propagates the null object. Logs out a warning when this happens.
   * K8sDruidNodeDiscoveryProvider: handles the null from the watch by restarting it. Logs a warning.
   <hr>
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist below are strictly necessary, but it would be very helpful if you at least self-review the PR. -->
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [x] been tested in a test Druid cluster.
   
   
   Note on testing: I didn't add unit tests to DefaultK8sApiClient. The infrastructure to do so was not present, unfortunately, and I suspect it'd be a large undertaking. In terms of testing in a cluster, I reproduced the issue using [microk8s](https://microk8s.io/). I then reproduced it with my fix, showing that the error message no longer occurred in a tight loop, and discovery still worked (I restarted a pod. It discovered the new one)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #12233: kubernetes: restart watch on null response

Posted by GitBox <gi...@apache.org>.
gianm commented on pull request #12233:
URL: https://github.com/apache/druid/pull/12233#issuecomment-1064497208


   > Thanks! I was a bit confused by that, since the affected tests seemed unrelated to my changes.
   
   Yeah, sorry about that. Weird things happen sometimes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #12233: kubernetes: restart watch on null response

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #12233:
URL: https://github.com/apache/druid/pull/12233#discussion_r818274025



##########
File path: extensions-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/K8sDruidNodeDiscoveryProvider.java
##########
@@ -282,7 +282,10 @@ private void keepWatching(String namespace, String labelSelector, String resourc
                 nextResourceVersion = item.object.getResourceVersion();
 
               } else {
-                LOGGER.error("WTH! item or item.type is NULL");
+                // Try again by starting the watch from the beginning. This can happen if the
+                // watch goes bad.
+                LOGGER.warn("Received NULL item. Restarting watch");

Review comment:
       Couple things about the log message:
   
   - From your description in the PR, it sounds like this is something that can happen "normally" and so INFO is better than WARN. The WARN level should be for something that a user might need to look into.
   - Would it be useful to include `nodeRole` here? (Up to you.)

##########
File path: extensions-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/DefaultK8sApiClient.java
##########
@@ -132,12 +132,22 @@ public boolean hasNext() throws SocketTimeoutException
             while (watch.hasNext()) {
               Watch.Response<V1Pod> item = watch.next();
               if (item != null && item.type != null) {
+                DiscoveryDruidNodeAndResourceVersion result = null;
+                if (item.object != null) {
+                  result = new DiscoveryDruidNodeAndResourceVersion(
+                    item.object.getMetadata().getResourceVersion(),
+                    getDiscoveryDruidNodeFromPodDef(nodeRole, item.object)
+                  );
+                } else {
+                  // The item's object can be null in some cases -- likely due to a blip
+                  // in the k8s watch. Handle that by passing the null upwards. The caller
+                  // needs to know that the object can be null.
+                  LOGGER.warn("item of type " + item.type + " was NULL");

Review comment:
       Similar to the other comment about logging — it sounds like this is something that can happen "normally", and we don't expect users to look into it when this message appears. If that's true, INFO or even DEBUG is better than WARN.

##########
File path: extensions-core/kubernetes-extensions/src/test/java/org/apache/druid/k8s/discovery/K8sDruidNodeDiscoveryProviderTest.java
##########
@@ -162,6 +163,140 @@ public void testGetForNodeRole() throws Exception
     discoveryProvider.stop();
   }
 
+  @Test(timeout = 10_000)
+  public void testNodeRoleWatcherHandlesNullFromAPIByRestarting() throws Exception

Review comment:
       Thanks for adding a test!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] klarose commented on pull request #12233: kubernetes: restart watch on null response

Posted by GitBox <gi...@apache.org>.
klarose commented on pull request #12233:
URL: https://github.com/apache/druid/pull/12233#issuecomment-1062157975


   @gianm Thanks for the review! Your comments made sense, and I have made changes accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] klarose commented on pull request #12233: kubernetes: restart watch on null response

Posted by GitBox <gi...@apache.org>.
klarose commented on pull request #12233:
URL: https://github.com/apache/druid/pull/12233#issuecomment-1033927493


   @himanshug Would you be able to review this? I'd like to get it into an official release so I can avoid building my own images. I'm sure the others that commented on the issue would as well.
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm merged pull request #12233: kubernetes: restart watch on null response

Posted by GitBox <gi...@apache.org>.
gianm merged pull request #12233:
URL: https://github.com/apache/druid/pull/12233


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #12233: kubernetes: restart watch on null response

Posted by GitBox <gi...@apache.org>.
gianm commented on pull request #12233:
URL: https://github.com/apache/druid/pull/12233#issuecomment-1057619334


   @klarose Please consider my comments about log messages, and changing the level if this really is a "normal" situation. I'd be OK with merging this otherwise. Thanks for the contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] klarose commented on pull request #12233: kubernetes: restart watch on null response

Posted by GitBox <gi...@apache.org>.
klarose commented on pull request #12233:
URL: https://github.com/apache/druid/pull/12233#issuecomment-1064453068


   > LGTM - thanks!
   > 
   > I've restarted the jdk15 tests. They were broken for a while due to an issue on the Travis side that I think is resolved now.
   
   Thanks! I was a bit confused by that, since the affected tests seemed unrelated to my changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org