You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/03 05:35:40 UTC

[GitHub] [arrow-datafusion] Dandandan opened a new issue #488: Map in to semi join

Dandandan opened a new issue #488:
URL: https://github.com/apache/arrow-datafusion/issues/488


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   Currently `IN` in queries is not supported.
   The recently added semi hash join can be used for those queries.
   
   **Describe the solution you'd like**
   Add this to planner/optimizer.
   
   **Describe alternatives you've considered**
   
   **Additional context**
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #488: Map `IN` to semi join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #488:
URL: https://github.com/apache/arrow-datafusion/issues/488#issuecomment-857405281


   Hey @msathis that would be great.
   
   Effectively it means rewriting queries from:
   
   ```
   SELECT a, b
   FROM 
   x
   WHERE a in (select b from t)
   ``` 
   
   Could be written as (minus SQL syntax)
   
   ```
   SELECT a, b
   FROM
   x
   SEMI JOIN t ON a=b
   ```
   
   So the work will be
   
   * adding `IN` as option to the `Expr` enum and adding it to the `sql/planner`.
   * extracting applicable `IN` expression and transforming it to (left and right) columns
   * converting it to a semi join (a join with `JoinType::Semi`) either directly in the planner, and/or add a optimization rule (e.g. translating a cross join to a semi join). the first would be fine for now.
   
   I think we can return an error in case the logical plan still contains a `IN` in a expression somewhere.
   
   One complication I saw is that adding a `LogicalPlan` to the `Expr` (for encoding `IN`) is not trivial, because `Expr` has some derived `Eq` etc. which the logical plan does not have.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #488: Map `IN` to semi join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #488:
URL: https://github.com/apache/arrow-datafusion/issues/488#issuecomment-857586240


   > @Dandandan I was debugging to see what happens with the current code. Seems like all `IN` clauses are converted to `Expr::InList`. Should we use the `InList` itself? Or it is intended for different purpose, so better create `Expr::In`?
   
   
   The `InList` is for the `IN (1, 2, 3)` (a literal list) syntax, whereas this will be for `IN (SELECT ....)` (in subquery).
   
   This currently results in an error if you try it (something like `Unsupported ast node SQLExpr::InSubQuery) in sqltorel`) which is defined here: https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/sql/planner.rs#L1205


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] msathis commented on issue #488: Map `IN` to semi join

Posted by GitBox <gi...@apache.org>.
msathis commented on issue #488:
URL: https://github.com/apache/arrow-datafusion/issues/488#issuecomment-857529666


   @Dandandan I was debugging to see what happens with the current code. Seems like all `IN` clauses are converted to `Expr::InList`. Should we use the `InList` itself? Or it is intended for different purpose, so better create `Expr::In`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] msathis edited a comment on issue #488: Map `IN` to semi join

Posted by GitBox <gi...@apache.org>.
msathis edited a comment on issue #488:
URL: https://github.com/apache/arrow-datafusion/issues/488#issuecomment-857597900


   @Dandandan Got it. Earlier I was trying with literal list syntax. Something like
   
   ```
   #[tokio::test]
   async fn test_in_expression() -> Result<()> {
       let mut ctx = create_ctx()?;
       register_aggregate_simple_csv(&mut ctx)?;
       let sql = "SELECT COUNT(1) from aggregate_simple where c3 IN('true')";
       let actual = execute(&mut ctx, sql).await;
       let r1 = actual[0][0].parse::<String>().unwrap();
       assert_eq!("true", r1);
       Ok(())
   }
   ```
   
    Even `InList` doesn't seem to support `Utf8` data types. So i wrongly assumed it was the issue to fix. Not it's all clear 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] msathis edited a comment on issue #488: Map `IN` to semi join

Posted by GitBox <gi...@apache.org>.
msathis edited a comment on issue #488:
URL: https://github.com/apache/arrow-datafusion/issues/488#issuecomment-857597900






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] msathis commented on issue #488: Map `IN` to semi join

Posted by GitBox <gi...@apache.org>.
msathis commented on issue #488:
URL: https://github.com/apache/arrow-datafusion/issues/488#issuecomment-857597900


   @Dandandan Got it. Earlier I was trying with literal list syntax. Something like
   
   `#[tokio::test]
   async fn test_in_expression() -> Result<()> {
       let mut ctx = create_ctx()?;
       register_aggregate_simple_csv(&mut ctx)?;
       let sql = "SELECT COUNT(1) from aggregate_simple where c3 IN('true')";
       let actual = execute(&mut ctx, sql).await;
       let r1 = actual[0][0].parse::<String>().unwrap();
       assert_eq!("true", r1);
       Ok(())
   }`
   
    Even `InList` doesn't seem to support `Utf8` data types. So i wrongly assumed it was the issue to fix. Not it's all clear 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] msathis commented on issue #488: Map `IN` to semi join

Posted by GitBox <gi...@apache.org>.
msathis commented on issue #488:
URL: https://github.com/apache/arrow-datafusion/issues/488#issuecomment-857393540


   @Dandandan I would like to work on this ticket. Some pointers would be great, so i will be in the right direction. 😎


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org