You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "MidasLamb (via GitHub)" <gi...@apache.org> on 2023/06/22 12:51:22 UTC

[GitHub] [arrow-datafusion] MidasLamb opened a new issue, #6744: Subsequent `JOIN` s don't match to correct row

MidasLamb opened a new issue, #6744:
URL: https://github.com/apache/arrow-datafusion/issues/6744

   ### Describe the bug
   
   I have a query where I `JOIN` three tables, and then have a WHERE clause on the last table ("twintag.owner"). If I give a value that exists there, all the results are returned (regardless of whether or not the "twintag.owner" matches the actual value), and if I give a value that doesn't exist, it returns nothing.
   
   The query is as follows:
   ```sql
   SELECT "75620d04a8b7b34c70c68f7af68953af".*
   FROM "75620d04a8b7b34c70c68f7af68953af" 
   JOIN "f4be58656cd5e54b9bffa1706c3c49ff" AS "linkedtwintag"
       ON "75620d04a8b7b34c70c68f7af68953af"."82ad7656c6b6be56e9abfac772c89b4c" = "linkedtwintag"."776ff32c7d7bfb1c1a49d5c0d94e0db8" 
   JOIN "__bag" AS "linkedtwintag_twintag"
       ON "linkedtwintag"."dataScope" = "linkedtwintag_twintag"."storage_qid" 
   WHERE 1 = 1 
   AND "linkedtwintag_twintag"."owner" = '324ae4533db15dcb19e3473fea54177a' 
   ORDER BY "75620d04a8b7b34c70c68f7af68953af"."qid" ASC
   ```
   
   If I start from the intermediate table it works as expected and I get back the one result I'm looking for:
   ```sql
   SELECT "f4be58656cd5e54b9bffa1706c3c49ff".*
   FROM "f4be58656cd5e54b9bffa1706c3c49ff" 
   JOIN "__bag" AS "twintag"
       ON "f4be58656cd5e54b9bffa1706c3c49ff"."dataScope" = "twintag"."storage_qid"
   WHERE 1 = 1 
   AND "twintag"."owner" = '324ae4533db15dcb19e3473fea54177a'  ORDER BY "f4be58656cd5e54b9bffa1706c3c49ff"."qid" ASC
   ```
   
   ### To Reproduce
   
   Create some tables where you can construct a JOIN from table A to B and from B to C.
   Execute the (type of) query, where you use the intermediate table to join from table A to C, filtering on something in table C:
   ```sql
   SELECT "75620d04a8b7b34c70c68f7af68953af".*
   FROM "75620d04a8b7b34c70c68f7af68953af" 
   JOIN "f4be58656cd5e54b9bffa1706c3c49ff" AS "linkedtwintag"
       ON "75620d04a8b7b34c70c68f7af68953af"."82ad7656c6b6be56e9abfac772c89b4c" = "linkedtwintag"."776ff32c7d7bfb1c1a49d5c0d94e0db8" 
   JOIN "__bag" AS "linkedtwintag_twintag"
       ON "linkedtwintag"."dataScope" = "linkedtwintag_twintag"."storage_qid" 
   WHERE 1 = 1 
   AND "linkedtwintag_twintag"."owner" = '324ae4533db15dcb19e3473fea54177a' 
   ORDER BY "75620d04a8b7b34c70c68f7af68953af"."qid" ASC
   ```
   
   ### Expected behavior
   
   I expect only the items which match the filter on table C to be returned.
   Currently it returns either ALL items if an item matches, or NO items if there is no match found. In the example above I expect 1 match, but I get all the items from table A back.
   If I change the value I'm looking for in table C to be non-existent there, I get back no results instead, which is expected.
   
   ### Additional context
   
   I'm also using datafusion-remote-table from seafowl (https://github.com/splitgraph/seafowl/tree/main/datafusion_remote_tables)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on issue #6744: Subsequent `JOIN` s don't match to correct row

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on issue #6744:
URL: https://github.com/apache/arrow-datafusion/issues/6744#issuecomment-1609377273

   No worries, this is the entire list:
   ```
           let rules: Vec<Arc<dyn OptimizerRule + Sync + Send>> = vec![
               Arc::new(SimplifyExpressions::new()),
               Arc::new(UnwrapCastInComparison::new()),
               Arc::new(ReplaceDistinctWithAggregate::new()),
               Arc::new(EliminateJoin::new()),
               Arc::new(DecorrelatePredicateSubquery::new()),
               Arc::new(ScalarSubqueryToJoin::new()),
               Arc::new(ExtractEquijoinPredicate::new()),
               // simplify expressions does not simplify expressions in subqueries, so we
               // run it again after running the optimizations that potentially converted
               // subqueries to joins
               Arc::new(SimplifyExpressions::new()),
               Arc::new(MergeProjection::new()),
               Arc::new(RewriteDisjunctivePredicate::new()),
               Arc::new(EliminateDuplicatedExpr::new()),
               Arc::new(EliminateFilter::new()),
               Arc::new(EliminateCrossJoin::new()),
               Arc::new(CommonSubexprEliminate::new()),
               Arc::new(EliminateLimit::new()),
               Arc::new(PropagateEmptyRelation::new()),
               Arc::new(FilterNullJoinKeys::default()),
               Arc::new(EliminateOuterJoin::new()),
               // Filters can't be pushed down past Limits, we should do PushDownFilter after PushDownLimit
               Arc::new(PushDownLimit::new()),
               Arc::new(PushDownFilter::new()),
               Arc::new(SingleDistinctToGroupBy::new()),
               // The previous optimizations added expressions and projections,
               // that might benefit from the following rules
               Arc::new(SimplifyExpressions::new()),
               Arc::new(UnwrapCastInComparison::new()),
               Arc::new(CommonSubexprEliminate::new()),
               Arc::new(PushDownProjection::new()),
               Arc::new(EliminateProjection::new()),
               // PushDownProjection can pushdown Projections through Limits, do PushDownLimit again.
               Arc::new(PushDownLimit::new()),
           ];
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] MidasLamb commented on issue #6744: Subsequent `JOIN` s don't match to correct row

Posted by "MidasLamb (via GitHub)" <gi...@apache.org>.
MidasLamb commented on issue #6744:
URL: https://github.com/apache/arrow-datafusion/issues/6744#issuecomment-1609421942

   It seems to replace an `Inner Join` with a `Cross Join` for some reason, which (in my particular case) isn't actually correct


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on issue #6744: Subsequent `JOIN` s don't match to correct row

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on issue #6744:
URL: https://github.com/apache/arrow-datafusion/issues/6744#issuecomment-1609336228

   Awesome - @MidasLamb . If possible, could you identify:
   
   * Which optimizer rule is at fault?
   * What is the change in logical / physical plan (EXPLAIN VERBOSE)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] MidasLamb commented on issue #6744: Subsequent `JOIN` s don't match to correct row

Posted by "MidasLamb (via GitHub)" <gi...@apache.org>.
MidasLamb commented on issue #6744:
URL: https://github.com/apache/arrow-datafusion/issues/6744#issuecomment-1608980072

   @alamb , I've created a reproduction repo here: https://github.com/MidasLamb/datafusion-multiple-join-bug-example
   
   If you search for "b93" in the output that is shown, you'll see in the first test that when the `WHERE` clause is added, there suddenly are a lot more results which show up out of nowhere compared to the first `show()` without a `WHERE` clause


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] MidasLamb commented on issue #6744: Subsequent `JOIN` s don't match to correct row

Posted by "MidasLamb (via GitHub)" <gi...@apache.org>.
MidasLamb commented on issue #6744:
URL: https://github.com/apache/arrow-datafusion/issues/6744#issuecomment-1609341934

   @Dandandan , I'm trying to find the default optimizer rules so I can start weeding them out, but I can't find them immediately, so that's why I already posted that little update


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] MidasLamb commented on issue #6744: Subsequent `JOIN` s don't match to correct row

Posted by "MidasLamb (via GitHub)" <gi...@apache.org>.
MidasLamb commented on issue #6744:
URL: https://github.com/apache/arrow-datafusion/issues/6744#issuecomment-1609521816

   I've found this piece of comment/code in the `EliminateCrossJoin` optimizer:
   ```rust
   // The filter of inner join will lost, skip this rule.
   // issue: https://github.com/apache/arrow-datafusion/issues/4844
   if join.filter.is_some() {
       return Ok(None);
   }
   ```
   However in my scenario the filtered join is not the top level, so that checks seems to get skipped.
   
   I've set up a test for this by doing this:
   ```rust
           let t1 = test_table_scan_with_name("t1")?;
           let t2 = test_table_scan_with_name("t2")?;
           let t3 = test_table_scan_with_name("t3")?;
   
           // could eliminate to inner join since filter has Join predicates
           let plan = LogicalPlanBuilder::from(t1)
               .join(t2, JoinType::Inner, (Vec::<Column>::new(), Vec::<Column>::new()), Some(col("t1.a").eq(col("t2.a"))))?
               .join(
                   t3,
                   JoinType::Inner,
                   (
                       vec![Column::from_qualified_name("t2.a")],
                       vec![Column::from_qualified_name("t3.a")],
                   ),
                   None,
               )?
               .filter(col("t2.c").lt(lit(20u32)))?
               .build()?;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Subsequent `JOIN` s don't match to correct row [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #6744: Subsequent `JOIN` s don't match to correct row
URL: https://github.com/apache/arrow-datafusion/issues/6744


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] MidasLamb commented on issue #6744: Subsequent `JOIN` s don't match to correct row

Posted by "MidasLamb (via GitHub)" <gi...@apache.org>.
MidasLamb commented on issue #6744:
URL: https://github.com/apache/arrow-datafusion/issues/6744#issuecomment-1609323620

   It seems to be a bug in a default optimizer, because adding this:
   ```rust
       let ctx = SessionContext::with_state(
           state
               .with_optimizer_rules(vec![])
       );
   
   ```
   gives back the results that I expect


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #6744: Subsequent `JOIN` s don't match to correct row

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6744:
URL: https://github.com/apache/arrow-datafusion/issues/6744#issuecomment-1604577124

   Thank you for the report @MidasLamb  -- is it possible to share a reproducer (aka the data referred to above)? This seems like it may be data specific


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org