You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/03/06 22:01:27 UTC

[GitHub] [druid] a2l007 opened a new pull request #10030: Support extensibilty for table datasources

a2l007 opened a new pull request #10030:
URL: https://github.com/apache/druid/pull/10030


   <!-- Thanks for trying to help us make Apache Druid be the best it can be! Please fill out as much of the following information as is possible (where relevant, and remove it when irrelevant) to help make the intention and scope of this PR clear in order to ease review. -->
   
   Fixes #9791 .
   
   <!-- Replace XXXX with the id of the issue fixed in this PR. Remove this section if there is no corresponding issue. Don't reference the issue in the title of this pull-request. -->
   
   <!-- If you are a committer, follow the PR action item checklist for committers:
   https://github.com/apache/druid/blob/master/dev/committer-instructions.md#pr-and-issue-action-item-checklist-for-committers. -->
   
   ### Description
   Introduces a new interface `MultiTableDataSource` that extends `DataSource` and provides extensibilty for multi table datasources. Presently, [UnionDataSource](https://druid.apache.org/docs/latest/querying/datasource.html#union) is the only datasource within druid under this category. For more details on the usecase, please refer to the proposal.
   `UnionDataSource` has been refactored to be an implementation of `MultiTableDataSource`. As a result, union datasource queries no longer execute sequentially per datasource and this improves the query performance.
   
   
   <hr>
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [x] been tested in a test Druid cluster.
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist above are strictly necessary, but it would be very helpful if you at least self-review the PR. -->
   
   <hr>
   ### Key classes changed in this PR
   
   1. `CachingClusteredClient`
   2. `ServerManager`
   3. `UnionDataSource`
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #10030: Support extensibilty for table datasources

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #10030:
URL: https://github.com/apache/druid/pull/10030#issuecomment-792074314


   @a2l007 would you please resolve conflicts? Also, I think this change requires a feature flag as the new behavior could affect other queries. Suppose you have a query of 100 datasources. After this change, Druid will execute the 100 subqueries simultaneously which can hog all processing threads in historicals. It can also change the priority computed by `ThresholdBasedQueryPrioritizationStrategy` as it will count all segments together to compute the query priority.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] a2l007 commented on a change in pull request #10030: Support extensibilty for table datasources

Posted by GitBox <gi...@apache.org>.
a2l007 commented on a change in pull request #10030:
URL: https://github.com/apache/druid/pull/10030#discussion_r477402603



##########
File path: processing/src/main/java/org/apache/druid/query/MultiTableDataSource.java
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query;
+
+import org.apache.druid.timeline.Overshadowable;
+import org.apache.druid.timeline.TimelineLookup;
+import org.apache.druid.timeline.TimelineObjectHolder;
+import org.joda.time.Interval;
+
+import java.util.List;
+import java.util.Map;
+import java.util.function.BiFunction;
+
+/**
+ * Represents a source of data for a query obtained from multiple base tables. Implementations of this interface
+ * must handle more than one table dataSource.

Review comment:
       makes sense. Thanks

##########
File path: processing/src/main/java/org/apache/druid/query/UnionDataSource.java
##########
@@ -26,12 +26,19 @@
 import com.google.common.collect.ImmutableList;
 import com.google.common.collect.Iterables;
 import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.timeline.Overshadowable;
+import org.apache.druid.timeline.TimelineLookup;
+import org.apache.druid.timeline.TimelineObjectHolder;
+import org.joda.time.Interval;
 
+import java.util.ArrayList;
 import java.util.List;
+import java.util.Map;
 import java.util.Set;
+import java.util.function.BiFunction;
 import java.util.stream.Collectors;
 
-public class UnionDataSource implements DataSource
+public class UnionDataSource implements MultiTableDataSource

Review comment:
       In that case, I would expect us to have a `MultiTableDataSource` implementation say `UnionTableDataSource` and the current `UnionDataSource` would go back to being a `DataSource` implementation that would start handling a list of `DataSource` objects. 
   Also, this would mean that we'd still have to keep the `UnionQueryRunner` to execute the nontable datasources and `UnionTableDataSource` separately and merge the results. Therefore for a union query with multiple tables and nontables, the tables would still be executed together under `UnionTableDataSource` and the non table datasources would be executed individually. There should be some refactoring within `ClientQuerySegmentWalker` in case we need to handle non tables that can be fulfilled with `LocalQuerySegmentWalker`.
   Do you see any issues with the `MultiTableDataSource` design in this PR?

##########
File path: server/src/main/java/org/apache/druid/server/coordination/ServerManager.java
##########
@@ -204,22 +224,45 @@ public ServerManager(
         analysis.getBaseQuery().orElse(query)
     );
 
-    final FunctionalIterable<QueryRunner<T>> queryRunners = FunctionalIterable
-        .create(specs)
-        .transformCat(
-            descriptor -> Collections.singletonList(
-                buildQueryRunnerForSegment(
+    final FunctionalIterable<QueryRunner<T>> queryRunners;
+    // This is to handle cases where multiple datasources may have the same SegmentDescriptor for an interval.

Review comment:
       yeah I was trying to avoid including datasources with segment descriptors as it would increase the size of the json payload transferred from the broker.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] stale[bot] closed pull request #10030: Support extensibilty for table datasources

Posted by GitBox <gi...@apache.org>.
stale[bot] closed pull request #10030:
URL: https://github.com/apache/druid/pull/10030


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] a2l007 commented on pull request #10030: Support extensibilty for table datasources

Posted by GitBox <gi...@apache.org>.
a2l007 commented on pull request #10030:
URL: https://github.com/apache/druid/pull/10030#issuecomment-668262883


   Added tests and fixed conflicts. PR should be good for review 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #10030: Support extensibilty for table datasources

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #10030:
URL: https://github.com/apache/druid/pull/10030#issuecomment-792071015


   @a2l007 sorry for the late review. I reopened this PR as I believe it's still worth. Will try to review soon.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #10030: Support extensibilty for table datasources

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #10030:
URL: https://github.com/apache/druid/pull/10030#discussion_r476204166



##########
File path: server/src/main/java/org/apache/druid/server/coordination/ServerManager.java
##########
@@ -204,22 +224,45 @@ public ServerManager(
         analysis.getBaseQuery().orElse(query)
     );
 
-    final FunctionalIterable<QueryRunner<T>> queryRunners = FunctionalIterable
-        .create(specs)
-        .transformCat(
-            descriptor -> Collections.singletonList(
-                buildQueryRunnerForSegment(
+    final FunctionalIterable<QueryRunner<T>> queryRunners;
+    // This is to handle cases where multiple datasources may have the same SegmentDescriptor for an interval.

Review comment:
       I don't think this is going to work in all cases. Consider the case where two datasources that are being unioned both have a segment with version V1, interval 2000/P1D, and partition 0. But let's say the second datasource _also_ has one with version V2. In this case V1 should be overshadowed, but with this logic, it won't be. Both V1 and V2 will be processed for the second datasource.
   
   I think to properly push this down to historicals, we'll need to change MultipleSpecificSegmentSpec to associate datasources with segment descriptors.

##########
File path: processing/src/main/java/org/apache/druid/query/MultiTableDataSource.java
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query;
+
+import org.apache.druid.timeline.Overshadowable;
+import org.apache.druid.timeline.TimelineLookup;
+import org.apache.druid.timeline.TimelineObjectHolder;
+import org.joda.time.Interval;
+
+import java.util.List;
+import java.util.Map;
+import java.util.function.BiFunction;
+
+/**
+ * Represents a source of data for a query obtained from multiple base tables. Implementations of this interface
+ * must handle more than one table dataSource.

Review comment:
       Hmm, why not make TableDataSource an implementation of this interface, and give it a pretty basic `retrieveSegmentsForIntervals` method? Then, we know that anything that access tables directly is an implementation of this interface. It might make some of the logic in DataSourceAnalysis simpler.

##########
File path: processing/src/main/java/org/apache/druid/query/UnionDataSource.java
##########
@@ -26,12 +26,19 @@
 import com.google.common.collect.ImmutableList;
 import com.google.common.collect.Iterables;
 import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.timeline.Overshadowable;
+import org.apache.druid.timeline.TimelineLookup;
+import org.apache.druid.timeline.TimelineObjectHolder;
+import org.joda.time.Interval;
 
+import java.util.ArrayList;
 import java.util.List;
+import java.util.Map;
 import java.util.Set;
+import java.util.function.BiFunction;
 import java.util.stream.Collectors;
 
-public class UnionDataSource implements DataSource
+public class UnionDataSource implements MultiTableDataSource

Review comment:
       In the future, if we wanted to extend UnionDataSource to support unioning nontables (like, perhaps, unioning the results of queries) then do you see a good migration path to get there?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] a2l007 edited a comment on pull request #10030: [WIP] Support extensibilty for table datasources

Posted by GitBox <gi...@apache.org>.
a2l007 edited a comment on pull request #10030:
URL: https://github.com/apache/druid/pull/10030#issuecomment-656218949


   @gianm @jihoonson PR is WIP due to missing tests and conflicts but it would be helpful to get your thoughts on the design here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] stale[bot] commented on pull request #10030: Support extensibilty for table datasources

Posted by GitBox <gi...@apache.org>.
stale[bot] commented on pull request #10030:
URL: https://github.com/apache/druid/pull/10030#issuecomment-751245254


   This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] stale[bot] commented on pull request #10030: Support extensibilty for table datasources

Posted by GitBox <gi...@apache.org>.
stale[bot] commented on pull request #10030:
URL: https://github.com/apache/druid/pull/10030#issuecomment-792070937


   This pull request/issue is no longer marked as stale.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] a2l007 commented on pull request #10030: [WIP] Support extensibilty for table datasources

Posted by GitBox <gi...@apache.org>.
a2l007 commented on pull request #10030:
URL: https://github.com/apache/druid/pull/10030#issuecomment-656218949


   @gianm @jihoonson PR is WIP due to missing tests but it would be great to get your thoughts on the design here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] a2l007 commented on pull request #10030: Support extensibilty for table datasources

Posted by GitBox <gi...@apache.org>.
a2l007 commented on pull request #10030:
URL: https://github.com/apache/druid/pull/10030#issuecomment-793053766


   @jihoonson Thanks for following up. I'll try to restore this PR soon.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #10030: [WIP] Support extensibilty for table datasources

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #10030:
URL: https://github.com/apache/druid/pull/10030#issuecomment-656473829


   @a2l007 sorry, I missed this PR. I left a question on https://github.com/apache/druid/issues/9791.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] stale[bot] commented on pull request #10030: Support extensibilty for table datasources

Posted by GitBox <gi...@apache.org>.
stale[bot] commented on pull request #10030:
URL: https://github.com/apache/druid/pull/10030#issuecomment-723501446


   This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org