You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/08/24 18:28:52 UTC

[GitHub] [druid] gianm opened a new pull request #10318: Fix handling of 'join' on top of 'union' datasources.

gianm opened a new pull request #10318:
URL: https://github.com/apache/druid/pull/10318


   The problem is that unions are typically rewritten into a series of individual queries on the underlying tables, but this isn't done when the union is wrapped in a join.
   
   The main changes are in UnionQueryRunner:
   
   1) Replace an `instanceof UnionQueryRunner` check with DataSourceAnalysis.
   2) Replace a `query.withDataSource` call with a new function, `Queries.withBaseDataSource`.
   
   Together, these enable UnionQueryRunner to "see through" a join.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #10318: Fix handling of 'join' on top of 'union' datasources.

Posted by GitBox <gi...@apache.org>.
gianm commented on pull request #10318:
URL: https://github.com/apache/druid/pull/10318#issuecomment-681048348


   I just pushed a new commit that changes the approach a bit: I reverted back to the old getBaseTableDataSource() behavior, and added a new method getBaseUnionDataSource(). I also added tests for unions of a single table.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a change in pull request #10318: Fix handling of 'join' on top of 'union' datasources.

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on a change in pull request #10318:
URL: https://github.com/apache/druid/pull/10318#discussion_r477104655



##########
File path: processing/src/main/java/org/apache/druid/query/UnionQueryRunner.java
##########
@@ -41,30 +42,27 @@ public UnionQueryRunner(
   public Sequence<T> run(final QueryPlus<T> queryPlus, final ResponseContext responseContext)
   {
     Query<T> query = queryPlus.getQuery();
-    DataSource dataSource = query.getDataSource();
-    if (dataSource instanceof UnionDataSource) {
+
+    final DataSourceAnalysis analysis = DataSourceAnalysis.forDataSource(query.getDataSource());
+
+    if (analysis.isConcreteTableBased() && analysis.getBaseTableDataSources().get().size() != 1) {

Review comment:
       can we use more explicit condition such as `analysis.getBaseDataSource() instanceof UnionDataSource` than this `analysis.getBaseTableDataSources().get().size() != 1`?  May be the assumption around number of tables will not be valid in future. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #10318: Fix handling of 'join' on top of 'union' datasources.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #10318:
URL: https://github.com/apache/druid/pull/10318#discussion_r477502295



##########
File path: processing/src/main/java/org/apache/druid/query/UnionQueryRunner.java
##########
@@ -41,30 +42,27 @@ public UnionQueryRunner(
   public Sequence<T> run(final QueryPlus<T> queryPlus, final ResponseContext responseContext)
   {
     Query<T> query = queryPlus.getQuery();
-    DataSource dataSource = query.getDataSource();
-    if (dataSource instanceof UnionDataSource) {
+
+    final DataSourceAnalysis analysis = DataSourceAnalysis.forDataSource(query.getDataSource());
+
+    if (analysis.isConcreteTableBased() && analysis.getBaseTableDataSources().get().size() != 1) {

Review comment:
       I pushed up a new commit that changes the approach a bit; let me know what you think. The `getBaseTableDataSources()` concept is gone and replaced by two separate methods, `getBaseTableDataSource()` and `getBaseUnionDataSource()`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a change in pull request #10318: Fix handling of 'join' on top of 'union' datasources.

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on a change in pull request #10318:
URL: https://github.com/apache/druid/pull/10318#discussion_r477555940



##########
File path: processing/src/main/java/org/apache/druid/query/UnionQueryRunner.java
##########
@@ -41,30 +42,27 @@ public UnionQueryRunner(
   public Sequence<T> run(final QueryPlus<T> queryPlus, final ResponseContext responseContext)
   {
     Query<T> query = queryPlus.getQuery();
-    DataSource dataSource = query.getDataSource();
-    if (dataSource instanceof UnionDataSource) {
+
+    final DataSourceAnalysis analysis = DataSourceAnalysis.forDataSource(query.getDataSource());
+
+    if (analysis.isConcreteTableBased() && analysis.getBaseTableDataSources().get().size() != 1) {

Review comment:
       Looks good now. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a change in pull request #10318: Fix handling of 'join' on top of 'union' datasources.

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on a change in pull request #10318:
URL: https://github.com/apache/druid/pull/10318#discussion_r477462769



##########
File path: processing/src/main/java/org/apache/druid/query/UnionQueryRunner.java
##########
@@ -41,30 +42,27 @@ public UnionQueryRunner(
   public Sequence<T> run(final QueryPlus<T> queryPlus, final ResponseContext responseContext)
   {
     Query<T> query = queryPlus.getQuery();
-    DataSource dataSource = query.getDataSource();
-    if (dataSource instanceof UnionDataSource) {
+
+    final DataSourceAnalysis analysis = DataSourceAnalysis.forDataSource(query.getDataSource());
+
+    if (analysis.isConcreteTableBased() && analysis.getBaseTableDataSources().get().size() != 1) {

Review comment:
       In that case, It would also be fine to have a static method `isUnionDataSource()` within `DataSourceAnalysis` class itself that can be used here as well as in `dataSourceAnalysis.getBaseTableDataSources()` 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #10318: Fix handling of 'join' on top of 'union' datasources.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #10318:
URL: https://github.com/apache/druid/pull/10318#discussion_r477580012



##########
File path: processing/src/main/java/org/apache/druid/query/UnionQueryRunner.java
##########
@@ -41,34 +44,51 @@ public UnionQueryRunner(
   public Sequence<T> run(final QueryPlus<T> queryPlus, final ResponseContext responseContext)
   {
     Query<T> query = queryPlus.getQuery();
-    DataSource dataSource = query.getDataSource();
-    if (dataSource instanceof UnionDataSource) {
 
-      return new MergeSequence<>(
-          query.getResultOrdering(),
-          Sequences.simple(
-              Lists.transform(
-                  ((UnionDataSource) dataSource).getDataSources(),
-                  new Function<DataSource, Sequence<T>>()
-                  {
-                    @Override
-                    public Sequence<T> apply(DataSource singleSource)
-                    {
-                      return baseRunner.run(
-                          queryPlus.withQuery(
-                              query.withDataSource(singleSource)
-                                   // assign the subqueryId. this will be used to validate that every query servers
-                                   // have responded per subquery in RetryQueryRunner
-                                   .withDefaultSubQueryId()
-                          ),
-                          responseContext
-                      );
-                    }
-                  }
-              )
-          )
-      );
+    final DataSourceAnalysis analysis = DataSourceAnalysis.forDataSource(query.getDataSource());
+
+    if (analysis.isConcreteTableBased() && analysis.getBaseUnionDataSource().isPresent()) {
+      // Union of tables.
+
+      final UnionDataSource unionDataSource = analysis.getBaseUnionDataSource().get();
+
+      if (unionDataSource.getDataSources().isEmpty()) {
+        // Shouldn't happen, because UnionDataSource doesn't allow empty unions.
+        throw new ISE("Unexpectedly received empty union");
+      } else if (unionDataSource.getDataSources().size() == 1) {

Review comment:
       Yeah, the idea is let's not create a MergeSequence if there's only going to be 1 thing.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a change in pull request #10318: Fix handling of 'join' on top of 'union' datasources.

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on a change in pull request #10318:
URL: https://github.com/apache/druid/pull/10318#discussion_r477462769



##########
File path: processing/src/main/java/org/apache/druid/query/UnionQueryRunner.java
##########
@@ -41,30 +42,27 @@ public UnionQueryRunner(
   public Sequence<T> run(final QueryPlus<T> queryPlus, final ResponseContext responseContext)
   {
     Query<T> query = queryPlus.getQuery();
-    DataSource dataSource = query.getDataSource();
-    if (dataSource instanceof UnionDataSource) {
+
+    final DataSourceAnalysis analysis = DataSourceAnalysis.forDataSource(query.getDataSource());
+
+    if (analysis.isConcreteTableBased() && analysis.getBaseTableDataSources().get().size() != 1) {

Review comment:
       In that case, It would also be fine to have a static method `isUnionDataSource(DataSource)` within `DataSourceAnalysis` class itself that can be used here as well as in `dataSourceAnalysis.getBaseTableDataSources()` 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #10318: Fix handling of 'join' on top of 'union' datasources.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #10318:
URL: https://github.com/apache/druid/pull/10318#discussion_r477443081



##########
File path: processing/src/main/java/org/apache/druid/query/UnionQueryRunner.java
##########
@@ -41,30 +42,27 @@ public UnionQueryRunner(
   public Sequence<T> run(final QueryPlus<T> queryPlus, final ResponseContext responseContext)
   {
     Query<T> query = queryPlus.getQuery();
-    DataSource dataSource = query.getDataSource();
-    if (dataSource instanceof UnionDataSource) {
+
+    final DataSourceAnalysis analysis = DataSourceAnalysis.forDataSource(query.getDataSource());
+
+    if (analysis.isConcreteTableBased() && analysis.getBaseTableDataSources().get().size() != 1) {

Review comment:
       Hmm, I was trying to avoid using `instanceof` checks outside of the DataSourceAnalysis class, and instead stick to cleanly defined interfaces as much as possible. So, I designed dataSourceAnalysis.getBaseTableDataSources() to encapsulate this logic (it has the `instanceof` check you're talking about).
   
   Do you think it'd be better to instead move the logic to UnionQueryRunner?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #10318: Fix handling of 'join' on top of 'union' datasources.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #10318:
URL: https://github.com/apache/druid/pull/10318#discussion_r477579754



##########
File path: processing/src/main/java/org/apache/druid/query/UnionQueryRunner.java
##########
@@ -41,30 +42,27 @@ public UnionQueryRunner(
   public Sequence<T> run(final QueryPlus<T> queryPlus, final ResponseContext responseContext)
   {
     Query<T> query = queryPlus.getQuery();
-    DataSource dataSource = query.getDataSource();
-    if (dataSource instanceof UnionDataSource) {
+
+    final DataSourceAnalysis analysis = DataSourceAnalysis.forDataSource(query.getDataSource());
+
+    if (analysis.isConcreteTableBased() && analysis.getBaseTableDataSources().get().size() != 1) {

Review comment:
       Great!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a change in pull request #10318: Fix handling of 'join' on top of 'union' datasources.

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on a change in pull request #10318:
URL: https://github.com/apache/druid/pull/10318#discussion_r477555752



##########
File path: processing/src/main/java/org/apache/druid/query/UnionQueryRunner.java
##########
@@ -41,34 +44,51 @@ public UnionQueryRunner(
   public Sequence<T> run(final QueryPlus<T> queryPlus, final ResponseContext responseContext)
   {
     Query<T> query = queryPlus.getQuery();
-    DataSource dataSource = query.getDataSource();
-    if (dataSource instanceof UnionDataSource) {
 
-      return new MergeSequence<>(
-          query.getResultOrdering(),
-          Sequences.simple(
-              Lists.transform(
-                  ((UnionDataSource) dataSource).getDataSources(),
-                  new Function<DataSource, Sequence<T>>()
-                  {
-                    @Override
-                    public Sequence<T> apply(DataSource singleSource)
-                    {
-                      return baseRunner.run(
-                          queryPlus.withQuery(
-                              query.withDataSource(singleSource)
-                                   // assign the subqueryId. this will be used to validate that every query servers
-                                   // have responded per subquery in RetryQueryRunner
-                                   .withDefaultSubQueryId()
-                          ),
-                          responseContext
-                      );
-                    }
-                  }
-              )
-          )
-      );
+    final DataSourceAnalysis analysis = DataSourceAnalysis.forDataSource(query.getDataSource());
+
+    if (analysis.isConcreteTableBased() && analysis.getBaseUnionDataSource().isPresent()) {
+      // Union of tables.
+
+      final UnionDataSource unionDataSource = analysis.getBaseUnionDataSource().get();
+
+      if (unionDataSource.getDataSources().isEmpty()) {
+        // Shouldn't happen, because UnionDataSource doesn't allow empty unions.
+        throw new ISE("Unexpectedly received empty union");
+      } else if (unionDataSource.getDataSources().size() == 1) {

Review comment:
       is it optimization because of which this `else if` is not folded into `else`? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #10318: Fix handling of 'join' on top of 'union' datasources.

Posted by GitBox <gi...@apache.org>.
gianm commented on pull request #10318:
URL: https://github.com/apache/druid/pull/10318#issuecomment-679829791


   > We can get rid of UnionQueryRunner itself with #10030 :)
   
   Ah, that looks interesting. I just commented on it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #10318: Fix handling of 'join' on top of 'union' datasources.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #10318:
URL: https://github.com/apache/druid/pull/10318#discussion_r477485508



##########
File path: processing/src/main/java/org/apache/druid/query/UnionQueryRunner.java
##########
@@ -41,30 +42,27 @@ public UnionQueryRunner(
   public Sequence<T> run(final QueryPlus<T> queryPlus, final ResponseContext responseContext)
   {
     Query<T> query = queryPlus.getQuery();
-    DataSource dataSource = query.getDataSource();
-    if (dataSource instanceof UnionDataSource) {
+
+    final DataSourceAnalysis analysis = DataSourceAnalysis.forDataSource(query.getDataSource());
+
+    if (analysis.isConcreteTableBased() && analysis.getBaseTableDataSources().get().size() != 1) {

Review comment:
       Hmm, in thinking about your comment I realized there is a bug here. The code doesn't work right if you have a UnionDataSource of a single TableDataSource. I'll fix it and add a test for this, and while doing so, I'll consider your comments about how to structure the code.
   
   Thanks for the comments!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] a2l007 commented on pull request #10318: Fix handling of 'join' on top of 'union' datasources.

Posted by GitBox <gi...@apache.org>.
a2l007 commented on pull request #10318:
URL: https://github.com/apache/druid/pull/10318#issuecomment-679335761


   We can get rid of UnionQueryRunner itself with  #10030 :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis merged pull request #10318: Fix handling of 'join' on top of 'union' datasources.

Posted by GitBox <gi...@apache.org>.
clintropolis merged pull request #10318:
URL: https://github.com/apache/druid/pull/10318


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org