You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/08 15:42:14 UTC

[GitHub] [arrow-datafusion] jackwener opened a new issue, #2181: [Discuss] Add struct `Query` for datafusion

jackwener opened a new issue, #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   I'm try to implement `InSubquery`.
   
   I found that there isn't `Query` struct in datafusion. I think that it's because there isn't subquery handle in datafusion.
   
   In the roadmap, datafusion include `subquery` implementation. I think `Query` struct is necessary just like `datafusion-expr`.
   
   **Describe the solution you'd like**
   
   **Describe alternatives you've considered**
   
   **Additional context**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1100687565

   I have started a design document:
   
   https://docs.google.com/document/d/1j5vHyva-T_5l3POnHSS0-r5TOp86sEkfSUcRLqJZaK4/edit?usp=sharing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1100139275

   See https://github.com/dask-contrib/dask-sql/issues/474 and its subtasks such as https://github.com/apache/arrow-datafusion/issues/2238


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
jackwener commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1093051727

   I guess that `reason` why use `datafusion-expr` instead of `sqlparser` is `sqlparser` need to adapt to various sql types and a little complex.
   
   Maybe we also need to use customized `Query` and other.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1093135208

   @jackwener  I am not sure about the need for `Query` -- if there is one, perhaps we can combine it with what @andygrove  is working on in https://github.com/apache/arrow-datafusion/pull/2172
   
   I was imagining that a query like this
   
   ```sql
   SELECT f.x, sq.a FROM foo as f JOIN (select a from bar) as sq ON (f.x = sq.a),
   ```
   
   results in a logical plan like this:
   ```
   Projection(f.x, sq.a)
     --> Join(Inner f.x = sq.a)
       --> TableScan(foo)
       --> Projection(a)  <--- Here is where the "subquery"'s LogicalPlan is attached 
         --> TableScan(bar)
   ```
   
   Similarly for
   ```sql
   SELECT f.x  FROM foo as f WHERE f.x IN  (select a from bar) as sq;
   ```
   
   results in a logical plan like this:
   ```
   Projection(f.x, sq.a)
     --> Join(Semi f.x = sq.a)
       --> TableScan(foo)
       --> Projection(a)  <--- Here is where the "subquery"'s LogicalPlan is attached 
         --> TableScan(bar)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1093136523

   See https://github.com/apache/arrow-datafusion/issues/1209 for some more thinking on this matter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] xudong963 commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
xudong963 commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1093699398

   Here is a framework https://github.com/apache/arrow-datafusion/pull/1649 for implementing subqueries that I wrote some time ago. After this framework is completed, I believe that many kinds of subqueries can be easily added, such as `in`, `scalar subquery`, and this framework takes decorrelated into account at the beginning.
   
   Currently this ticket encounters some problems because datafusion currently records columns by name, not by index.
   
   Since I've been busy with changing jobs and adjusting to a new life, I haven't had much time to move forward with this ticket, so it's a bit delayed :(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1100453335

   > that being said, I think we can get very far in DataFusion (maybe forever) by handling only uncorrelated and equi-join correlated subqueries and avoid LogicalPlan in Expr
   
   For dask-sql, I have a requirement to support a popular decision support SQL benchmark that has correlated subqueries that are not all equi-join. Even if we were just looking to support equi-join, I would expect to build a simple logical plan first and then have an opti,mizer rule re-write the plan to a SEMI-JOIN or ANTI-JOIN. I don't think it would be feasible to try and do this while translating a SQL AST to a logical plan before the entire query is built.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1103165557

   I plan to read the design document carefully tomorrow morning and hopefully then I can have a useful conversation about this topic. Thank you @andygrove for driving it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1110949387

   I think this issue can be closed now that we have the following logical plan structures:
   
   - Expr::Exists
   - Expr::InSubquery
   - Expr::ScalarSubquery
   - LogicalPlan::Subquery
   - LogicalPlan::SubqueryAlias


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove closed issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
andygrove closed issue #2181: [Discuss] Add struct `Query` for datafusion
URL: https://github.com/apache/arrow-datafusion/issues/2181


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
jackwener commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1093020933

   @alamb @yahoNanJing @yjshen @mingmwang @tustvold How do you think about it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] xudong963 commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
xudong963 commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1093876415

   > I think it's not good choice to embed a `logicplan` to Expr.
   
   Why
   
   > I think it's not good choice to embed a `logicplan` to Expr.
   
   why


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1100450495

   Perhaps @Jimexist can offer an opinion on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1100448848

   So I started prototyping and ran into an issue right away. As others have pointed out, we sometimes need to have an `Expr` that refers to a `LogicalPlan`, such as:
   
   ``` rust
   enum Expr {
     Exists {
       subquery: Box<LogicalPlan>
     }
     ...
   }
   ```
   
   However, the LogicalPlan enum is in the `datafusion` crate which depends on the `datafusion-expr` crate so I cannot reference `LogicalPlan` from `Expr`. 
   
   I was not paying much attention when this refactoring happened. Does it make sense to now move `LogicalPlan` as well to `datafusion-expr` crate?
   
   
       


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] xudong963 commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
xudong963 commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1096620787

   Subquery processing is a complex matter, and I highly recommend that we figure out how to do it at the beginning (how to handle associative and non-associative), otherwise there will be a lot of bugs, and hard to cover areas later. 
   
   I recommend reading this article by materialize: https://materialize.com/decorrelation-subquery-optimization/, which mentions three articles, especially the tum one, which are very informative.
   
   My previous framework was implemented according to this, and I will close it first. I'll continue when I'm free later, if others haven't done complete support :). 
   
   If anyone else has a better idea, feel free to ping me for a review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
jackwener commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1096604556

   > Correlated queries that use something other than = (non equijoins) eventually have to be run once per each row in the outer query and would likely be expressed as a LogicalPlan in an Expr.
   
   Maybe, we can use `Query`  in an `Expr`.
   
   Because the progress is 
   
   |   AST and binder  -->  Logical plan  --> Execute plan   |
   
   Query is in `AST and binder` and will transfer to `LogicalPlan`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1096587791

   There are two kinds of subqueries
   * Uncorrelated (aka subqueries that appear in the from clause that don't have references to the outer query)
   * Correlated (subqueries that appear in the from clause or IN / NOT IN /etc and that refer to the outer query)
   
   For correlated queries that that refer to the outer query using `=` predicates they can be transformed into SEMI-JOIN / ANTI-JOIN typically
   
   Correlated queries that use something other than `=` (non equijoins) eventually have to be run once per each row in the outer query and would likely be expressed as a `LogicalPlan` in an `Expr`. 
   
   that being said, I think we can get very far in DataFusion (maybe forever) by handling only uncorrelated and equi-join correlated subqueries and avoid LogicalPlan in Expr


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1094490524

   > 
   
   
   
   > > Here is a framework #1649 for implementing subqueries that I wrote some time ago. After this framework is completed, I believe that many kinds of subqueries can be easily added, such as in, scalar subquery, and this framework takes decorrelated into account at the beginning.
   > 
   > I also see this PR. I think it's not good choice to embed a `logicplan` to Expr.
   
   I also have the same doubt.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
jackwener commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1093833121

   > Here is a framework https://github.com/apache/arrow-datafusion/pull/1649 for implementing subqueries that I wrote some time ago. After this framework is completed, I believe that many kinds of subqueries can be easily added, such as in, scalar subquery, and this framework takes decorrelated into account at the beginning.
   
   I also see this PR. I think it's not good choice to embed a `logicplan` to Expr.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1099673457

   I just wanted to add a note here to say that I have dedicated time now to help with this effort. I think we can break this work down into phases and the logical starting point (pun intended) would be to decide on how we want to represent the various types of subquery in the LogicalPlan. Once we have that we can create separate issues and PRs for SQL query planning, DataFrame API, logical plan optimizations, and implementing the physical plans.
   
   I am going to start by reviewing Apache Spark's logical plan for subqueries and then will put a proposal up. I'm not sure if this will be in the form of a PR or a Google document yet but I will share the proposal here, probably early next week.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1101719864

   > I have a requirement to support a popular decision support SQL benchmark that has correlated subqueries that are not all equi-join. 
   
   What is the query? Does it have equi-join correlation as well as some non equijoins?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] paveltiunov commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
paveltiunov commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1103083741

   @andygrove @alamb I guess great timing! We just implemented the correlated subquery here https://github.com/cube-js/arrow-datafusion/commit/d14f0de4c042c9a14b40acd4768c6e4f6585ee93. If you're open I can bring it as a PR and would love to shape it to a more canonical structure. There're multiple caveats some of which are covered in the design document like circular dependency between `Expr` and `LogicalPlan`. We'd love to see correlated subquery support merged as we don't have incentives to support such a big change in our fork.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on issue #2181: [Discuss] Add struct `Query` for datafusion

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #2181:
URL: https://github.com/apache/arrow-datafusion/issues/2181#issuecomment-1105303702

   > @andygrove @alamb I guess great timing! We just implemented the correlated subquery here [cube-js@d14f0de](https://github.com/cube-js/arrow-datafusion/commit/d14f0de4c042c9a14b40acd4768c6e4f6585ee93). If you're open I can bring it as a PR and would love to shape it to a more canonical structure. There're multiple caveats some of which are covered in the design document like circular dependency between `Expr` and `LogicalPlan`. We'd love to see correlated subquery support merged as we don't have incentives to support such a big change in our fork.
   
   Hi @paveltiunov Yes I would love to see a PR to add subquery support. The circular dependency issue is resolved in https://github.com/apache/arrow-datafusion/pull/2294 so that should help. I will start reviewing your implementation now to get up to speed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org