You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "clintropolis (via GitHub)" <gi...@apache.org> on 2023/03/14 08:35:04 UTC

[GitHub] [druid] clintropolis opened a new pull request, #13930: add null safety checks for DiscoveryDruidNode services for more resilient http server and task views

clintropolis opened a new pull request, #13930:
URL: https://github.com/apache/druid/pull/13930

   ### Description
   This PR adds some additional null checks to `HttpServerInventoryView`, `HttpRemoteTaskRunner`, `LookupNodeDiscovery`, and `SystemSchema` to ensure that the `DruidService` they are trying to extract from `DiscoveryDruidNode` are not null.
   
   This change was motivated by an encounter with a null pointer exception in `HttpServerInventoryView` when using 'indexer' nodes to handle ingestion, which prior to #12640 used to announce themselves as both 'indexer' and 'peon' node roles, with the latter having an empty services set. When the indexer goes offline, the empty services section caused a null pointer exception since it was expected to never be null.
   
   
   <hr>
   
   This PR has:
   
   - [ ] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] paul-rogers commented on a diff in pull request #13930: add null safety checks for DiscoveryDruidNode services for more resilient http server and task views

Posted by "paul-rogers (via GitHub)" <gi...@apache.org>.
paul-rogers commented on code in PR #13930:
URL: https://github.com/apache/druid/pull/13930#discussion_r1139445564


##########
server/src/main/java/org/apache/druid/server/coordination/ServerType.java:
##########
@@ -149,6 +149,18 @@ public static ServerType fromString(String type)
     return ServerType.valueOf(StringUtils.toUpperCase(type).replace('-', '_'));
   }
 
+  public static ServerType fromNodeRole(NodeRole nodeRole)
+  {
+    // this doesn't actually check that the NodeRole is a typical data node
+    if (nodeRole.equals(NodeRole.HISTORICAL)) {
+      return ServerType.HISTORICAL;
+    } else if (nodeRole.equals(NodeRole.BROKER)) {
+      return ServerType.BROKER;
+    } else {
+      return ServerType.INDEXER_EXECUTOR;

Review Comment:
   Should this handle the actual indexer executor roles? If I were to call this with `NodeRole.COORDINATOR`, I'd learn that id is an indexer executor?



##########
indexing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java:
##########
@@ -565,13 +566,26 @@ public void nodeViewInitialized()
 
   private Worker toWorker(DiscoveryDruidNode node)
   {
+    final WorkerNodeService workerNodeService = node.getService(WorkerNodeService.DISCOVERY_SERVICE_KEY, WorkerNodeService.class);
+    if (workerNodeService == null) {
+      // this shouldn't typically happen, but just in case it does, make a dummy worker to allow the callbacks to
+      // continue since addWorker/removeWorker only need worker.getHost()

Review Comment:
   This kind of thing is a great short-term fix, but seems like a ticking time bomb. Some later bit of code will call this expecting to get a real `Worker`. Should the code instead either a) change to just do `getNodeHost()`, or if it really is a `toWorker()`, return `null` if there isn't actually a worker?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] imply-cheddar commented on a diff in pull request #13930: add null safety checks for DiscoveryDruidNode services for more resilient http server and task views

Posted by "imply-cheddar (via GitHub)" <gi...@apache.org>.
imply-cheddar commented on code in PR #13930:
URL: https://github.com/apache/druid/pull/13930#discussion_r1135222992


##########
server/src/main/java/org/apache/druid/discovery/DiscoveryDruidNode.java:
##########
@@ -164,6 +166,13 @@ public DruidNode getDruidNode()
     return druidNode;
   }
 
+  @Nullable
+  @JsonIgnore
+  public <T extends DruidService> T getService(String key)
+  {
+    return (T) services.get(key);
+  }
+

Review Comment:
   What are the chances that there's a ClassCastException thrown here?  Should we try to be defensive?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis merged pull request #13930: add null safety checks for DiscoveryDruidNode services for more resilient http server and task views

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis merged PR #13930:
URL: https://github.com/apache/druid/pull/13930


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #13930: add null safety checks for DiscoveryDruidNode services for more resilient http server and task views

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #13930:
URL: https://github.com/apache/druid/pull/13930#discussion_r1140685776


##########
server/src/main/java/org/apache/druid/server/coordination/ServerType.java:
##########
@@ -149,6 +149,18 @@ public static ServerType fromString(String type)
     return ServerType.valueOf(StringUtils.toUpperCase(type).replace('-', '_'));
   }
 
+  public static ServerType fromNodeRole(NodeRole nodeRole)
+  {
+    // this doesn't actually check that the NodeRole is a typical data node
+    if (nodeRole.equals(NodeRole.HISTORICAL)) {
+      return ServerType.HISTORICAL;
+    } else if (nodeRole.equals(NodeRole.BROKER)) {
+      return ServerType.BROKER;
+    } else {
+      return ServerType.INDEXER_EXECUTOR;

Review Comment:
   Yeah, I thought about that. Nothing _should_ be using `ServerType` unless they are dealing with nodes that can hold segments so its quite unexpected that other node roles would end up here. I was hesitant to add another precondition here because throwing any sort of exception in the callback that is using this method is pretty uncool, so if we add stricter validation here then the callers should probably either do their own `NodeRole` check or catch the exceptions (just in case improper node roles somehow get in the watch list).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #13930: add null safety checks for DiscoveryDruidNode services for more resilient http server and task views

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #13930:
URL: https://github.com/apache/druid/pull/13930#discussion_r1135285104


##########
server/src/main/java/org/apache/druid/discovery/DiscoveryDruidNode.java:
##########
@@ -164,6 +166,13 @@ public DruidNode getDruidNode()
     return druidNode;
   }
 
+  @Nullable
+  @JsonIgnore
+  public <T extends DruidService> T getService(String key)
+  {
+    return (T) services.get(key);
+  }
+

Review Comment:
   went ahead and changed to do safer cast which changed the signature to need a `Class<T>` which seems ok :shrug:



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #13930: add null safety checks for DiscoveryDruidNode services for more resilient http server and task views

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #13930:
URL: https://github.com/apache/druid/pull/13930#discussion_r1135256873


##########
server/src/main/java/org/apache/druid/discovery/DiscoveryDruidNode.java:
##########
@@ -164,6 +166,13 @@ public DruidNode getDruidNode()
     return druidNode;
   }
 
+  @Nullable
+  @JsonIgnore
+  public <T extends DruidService> T getService(String key)
+  {
+    return (T) services.get(key);
+  }
+

Review Comment:
   like if rogue nodes announce things at the wrong key?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #13930: add null safety checks for DiscoveryDruidNode services for more resilient http server and task views

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #13930:
URL: https://github.com/apache/druid/pull/13930#discussion_r1140697270


##########
indexing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java:
##########
@@ -565,13 +566,26 @@ public void nodeViewInitialized()
 
   private Worker toWorker(DiscoveryDruidNode node)
   {
+    final WorkerNodeService workerNodeService = node.getService(WorkerNodeService.DISCOVERY_SERVICE_KEY, WorkerNodeService.class);
+    if (workerNodeService == null) {
+      // this shouldn't typically happen, but just in case it does, make a dummy worker to allow the callbacks to
+      // continue since addWorker/removeWorker only need worker.getHost()

Review Comment:
   Yeah, I would agree it seems a bit fragile, I was trying to avoid a larger refactor since the code flow here is a bit complicated. The thing behind both `addWorker` and `removeWorker` (and the analog methods on `HttpServerInventoryView`, `serverAdded` and `serverRemoved`) is combining tasks (or segments) discovered and the worker/server that is serving them to track what is where, so theoretically any workers/servers that are .. "service-less" should just hang out harmlessly in the collection of known hosts since no segments or tasks should be associated with them, and the host part is the only thing used when adding them or removing them from the collection.
   
   It would probably be worth refactoring to modify these views to just silently ignore these servers instead of adding them to the collection, but seemed a bit higher risk than I was hoping to take on at the moment, so I was trying to add these guards here "just in case" because the consequences of a failure in these callbacks is pretty bad. Maybe its not as hard as I'm making it out to be though, I'll have a look to see how complicated it would be to just ignore  nodes that are not `WorkerNodeService` or `DataNodeService`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org