You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "jiangzhx (via GitHub)" <gi...@apache.org> on 2023/03/29 06:28:33 UTC

[GitHub] [arrow-datafusion] jiangzhx opened a new issue, #5771: use dataframe to filter with exists subquery,TableScan node added a unexpected projection

jiangzhx opened a new issue, #5771:
URL: https://github.com/apache/arrow-datafusion/issues/5771

   ### Describe the bug
   
   filter with exists subquery,TableScan node added a unexpected projection
   
           "+--------------+-------------------------------------------------------+",
           "| plan_type    | plan                                                  |",
           "+--------------+-------------------------------------------------------+",
           "| logical_plan | Filter: EXISTS (<subquery>)                           |",
           "|              |   Subquery:                                           |",
           "|              |     Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]] |",
           "|              |       TableScan: t2 projection=[a]                    |",
           "|              |   TableScan: t1 projection=[a, b]                     |",
           "+--------------+-------------------------------------------------------+",
   
   
   ### To Reproduce
   
   ```
   
   #[tokio::test]
   async fn test_count_wildcard_on_where_exist() -> Result<()> {
       let ctx = create_join_context()?;
   
       let df_results = ctx
           .table("t1")
           .await?
           .filter(Expr::Exists {
               subquery: Subquery {
                   subquery: Arc::new(
                       ctx.table("t2")
                           .await?
                           .aggregate(vec![], vec![count(Expr::Wildcard)])?
                           .select(vec![count(Expr::Wildcard)])?
                           .into_optimized_plan()?,
                   ),
                   outer_ref_columns: vec![],
               },
               negated: false,
           })?
           .select(vec![col("a"), col("b")])?
           .explain(false, false)?
           .collect()
           .await?;
       #[rustfmt::skip]
           let expected = vec![
           "+--------------+-------------------------------------------------------+",
           "| plan_type    | plan                                                  |",
           "+--------------+-------------------------------------------------------+",
           "| logical_plan | Filter: EXISTS (<subquery>)                           |",
           "|              |   Subquery:                                           |",
           "|              |     Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]] |",
           "|              |       TableScan: t2 projection=[a]                    |",
           "|              |   TableScan: t1 projection=[a, b]                     |",
           "+--------------+-------------------------------------------------------+",
       ];
       assert_batches_eq!(expected, &df_results);
       Ok(())
   }
   
   ```
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jiangzhx commented on issue #5771: use dataframe to filter with exists subquery,TableScan node added a unexpected projection

Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.

jiangzhx commented on issue #5771:
URL: https://github.com/apache/arrow-datafusion/issues/5771#issuecomment-1488257740

   > I think there are 2 issues here.
   > 
   > 1. In the SQL case, the subquery plan inside the sub query expression does not get a chance to run all the optimizer rules,
   >    this is a known issue.
   > 2. In the DataFrame case, after apply the push_down_projection  rule, the subquery plan added unnecessary projection
   
   Thank you for confirming my idea, I will try to fix the push_down_projection unexpected projection.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jiangzhx commented on issue #5771: use dataframe to filter with exists subquery,TableScan node added a unexpected projection

Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.

jiangzhx commented on issue #5771:
URL: https://github.com/apache/arrow-datafusion/issues/5771#issuecomment-1488227022

   after push_down_projection optimize:
   
   Projection: COUNT(*)
     Aggregate: groupBy=[[]], aggr=[[COUNT(*)]]
       TableScan: t2 projection=[a]
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mingmwang commented on issue #5771: use dataframe to filter with exists subquery,TableScan node added a unexpected projection

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.

mingmwang commented on issue #5771:
URL: https://github.com/apache/arrow-datafusion/issues/5771#issuecomment-1488253981

   I think there are 2 issues here.
   1. In the SQL case, the subquery plan inside the sub query expression does not get a chance to run all the optimizer rules,
   this is a known issue.
   2. In the DataFrame case, after apply the push_down_projection  rule, the subquery plan added unnecessary projection


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mingmwang commented on issue #5771: use dataframe to filter with exists subquery,TableScan node added a unexpected projection

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.

mingmwang commented on issue #5771:
URL: https://github.com/apache/arrow-datafusion/issues/5771#issuecomment-1488567638

   Well, you are right and this is not a bug, but I think it is an improvement point,  if the backend are parquet files, we can just read the parquet files footers instead of reading any real fields. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org