You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/26 07:46:32 UTC

[GitHub] [arrow-datafusion] seddonm1 opened a new pull request #778: JOIN conditions are order dependent

seddonm1 opened a new pull request #778:
URL: https://github.com/apache/arrow-datafusion/pull/778


   # Which issue does this PR close?
   
   Closes #777.
   
    # Rationale for this change
   Currently the Logical Plan builder assumes that the join conditions fields are provided in the same order as the tables are specified. This is different behavior to how Postgres works which does not care about the order of the tables listed in the condition. This PR aims to rectify this discrepancy.
   
   # What changes are included in this PR?
   This change allows the conditions to be reversed and will fail as previously if neither relation contains the column. A lot of the tests in `sql.rs` have been updated to test both conditions.
   
   # Are there any user-facing changes?
   This does not change public APIs, and should not be noticeable by any current users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan merged pull request #778: JOIN conditions are order dependent

Posted by GitBox <gi...@apache.org>.
Dandandan merged pull request #778:
URL: https://github.com/apache/arrow-datafusion/pull/778


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on a change in pull request #778: JOIN conditions are order dependent

Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #778:
URL: https://github.com/apache/arrow-datafusion/pull/778#discussion_r677152935



##########
File path: datafusion/src/logical_plan/builder.rs
##########
@@ -282,14 +282,28 @@ impl LogicalPlanBuilder {
             ));
         }
 
-        let left_keys: Vec<Column> = left_keys
-            .into_iter()
-            .map(|c| c.into().normalize(&self.plan))
-            .collect::<Result<_>>()?;
-        let right_keys: Vec<Column> = right_keys
-            .into_iter()
-            .map(|c| c.into().normalize(right))
-            .collect::<Result<_>>()?;
+        let (left_keys, right_keys): (Vec<Result<Column>>, Vec<Result<Column>>) =

Review comment:
       Yeah, I think the argument name makes sense with the old assumption that left_keys should only refer to columns from the left plan. With this fix, it can become confusing.
   
   Perhaps a better signature for this function would be something along the lines of: `join_keys: Vec<(impl Into<Column>, impl Into<Column>)>`.
   
   If we want the dataframe API to also support order independent JOIN conditions, then it is indeed better to address the fix within this builder method. Otherwise I agree it would be better to fix it in the SQL planner instead. I am leaning towards supporting this in dataframe API as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #778: JOIN conditions are order dependent

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #778:
URL: https://github.com/apache/arrow-datafusion/pull/778#discussion_r676827083



##########
File path: datafusion/src/logical_plan/builder.rs
##########
@@ -282,14 +282,42 @@ impl LogicalPlanBuilder {
             ));
         }
 
+        let left_keys = left_keys
+            .into_iter()
+            .map(|c| c.into())
+            .collect::<Vec<Column>>();
+
+        let mut swap = false;
         let left_keys: Vec<Column> = left_keys
+            .clone()
             .into_iter()
-            .map(|c| c.into().normalize(&self.plan))
-            .collect::<Result<_>>()?;
+            .map(|c| c.normalize(&self.plan))
+            .collect::<Result<_>>()
+            .or_else(|_| {
+                swap = true;

Review comment:
       Shouldn't "swapping" be done on an individual level, e.g. for `left1=right1 and right2=left2` it will swap everything now?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] seddonm1 commented on a change in pull request #778: JOIN conditions are order dependent

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #778:
URL: https://github.com/apache/arrow-datafusion/pull/778#discussion_r676947049



##########
File path: datafusion/src/logical_plan/builder.rs
##########
@@ -282,14 +282,42 @@ impl LogicalPlanBuilder {
             ));
         }
 
+        let left_keys = left_keys
+            .into_iter()
+            .map(|c| c.into())
+            .collect::<Vec<Column>>();
+
+        let mut swap = false;
         let left_keys: Vec<Column> = left_keys
+            .clone()
             .into_iter()
-            .map(|c| c.into().normalize(&self.plan))
-            .collect::<Result<_>>()?;
+            .map(|c| c.normalize(&self.plan))
+            .collect::<Result<_>>()
+            .or_else(|_| {
+                swap = true;

Review comment:
       Yes, thanks for you review. I will address this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] seddonm1 commented on a change in pull request #778: JOIN conditions are order dependent

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #778:
URL: https://github.com/apache/arrow-datafusion/pull/778#discussion_r677031075



##########
File path: datafusion/tests/sql.rs
##########
@@ -1754,51 +1779,70 @@ async fn equijoin_and_unsupported_condition() -> Result<()> {
 #[tokio::test]
 async fn left_join() -> Result<()> {
     let mut ctx = create_join_context("t1_id", "t2_id")?;
-    let sql = "SELECT t1_id, t1_name, t2_name FROM t1 LEFT JOIN t2 ON t1_id = t2_id ORDER BY t1_id";
-    let actual = execute(&mut ctx, sql).await;
+    let equivalent_sql = [

Review comment:
       My main objective was to ensure nothing specific to equijoin (`INNER`) would break this implementation for the other join types (which occurred when developing it). This may be more relevant if any further optimisations occur.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] seddonm1 commented on a change in pull request #778: JOIN conditions are order dependent

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #778:
URL: https://github.com/apache/arrow-datafusion/pull/778#discussion_r677028886



##########
File path: datafusion/src/logical_plan/builder.rs
##########
@@ -282,14 +282,28 @@ impl LogicalPlanBuilder {
             ));
         }
 
-        let left_keys: Vec<Column> = left_keys
-            .into_iter()
-            .map(|c| c.into().normalize(&self.plan))
-            .collect::<Result<_>>()?;
-        let right_keys: Vec<Column> = right_keys
-            .into_iter()
-            .map(|c| c.into().normalize(right))
-            .collect::<Result<_>>()?;
+        let (left_keys, right_keys): (Vec<Result<Column>>, Vec<Result<Column>>) =

Review comment:
       > Welcome back @seddonm1 ! It is great to see a contribution from you. Thank you very much
   
   Thanks @alamb. Unfortunately my employment situation makes it hard to contribute too much but I am still very interested in seeing Datafusion succced.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #778: JOIN conditions are order dependent

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #778:
URL: https://github.com/apache/arrow-datafusion/pull/778#discussion_r677019621



##########
File path: datafusion/src/logical_plan/builder.rs
##########
@@ -282,14 +282,28 @@ impl LogicalPlanBuilder {
             ));
         }
 
-        let left_keys: Vec<Column> = left_keys
-            .into_iter()
-            .map(|c| c.into().normalize(&self.plan))
-            .collect::<Result<_>>()?;
-        let right_keys: Vec<Column> = right_keys
-            .into_iter()
-            .map(|c| c.into().normalize(right))
-            .collect::<Result<_>>()?;
+        let (left_keys, right_keys): (Vec<Result<Column>>, Vec<Result<Column>>) =

Review comment:
       I don't fully understand this code -- if `LogicalPlanBuilder::join` is called it has two inputs, a left (`self`) and a right (`right`) so the idea that `left_keys` might contain references to `right` is confusing to me.
   
   I wonder if there might be a better place to fix up the errors -- perhaps  somewhere further up in the call stack thatn  `LogicalPlanBuilder::join` -- for example `extract_join_keys` in the SQL planner (especially since this bug seems related to SQL planning)
   
   https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/sql/planner.rs#L372-L380

##########
File path: datafusion/tests/sql.rs
##########
@@ -1754,51 +1779,70 @@ async fn equijoin_and_unsupported_condition() -> Result<()> {
 #[tokio::test]
 async fn left_join() -> Result<()> {
     let mut ctx = create_join_context("t1_id", "t2_id")?;
-    let sql = "SELECT t1_id, t1_name, t2_name FROM t1 LEFT JOIN t2 ON t1_id = t2_id ORDER BY t1_id";
-    let actual = execute(&mut ctx, sql).await;
+    let equivalent_sql = [

Review comment:
       I wonder if this test case and those below it add any additional coverage compared to 
   
   ```rust
   async fn equijoin_multiple_condition_ordering() -> Result<()> {
   ```
   
   Like are there bugs that would cause `equijoin_multiple_condition_ordering` to pass but the others to fail?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #778: JOIN conditions are order dependent

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #778:
URL: https://github.com/apache/arrow-datafusion/pull/778#discussion_r677361196



##########
File path: datafusion/src/logical_plan/builder.rs
##########
@@ -282,14 +282,28 @@ impl LogicalPlanBuilder {
             ));
         }
 
-        let left_keys: Vec<Column> = left_keys
-            .into_iter()
-            .map(|c| c.into().normalize(&self.plan))
-            .collect::<Result<_>>()?;
-        let right_keys: Vec<Column> = right_keys
-            .into_iter()
-            .map(|c| c.into().normalize(right))
-            .collect::<Result<_>>()?;
+        let (left_keys, right_keys): (Vec<Result<Column>>, Vec<Result<Column>>) =

Review comment:
       changing the signature to be `join_keys: Vec<(impl Into<Column>, impl Into<Column>)>.` instead of `left_keys:...` and `right_keys: ...` makes sense to me




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] seddonm1 commented on a change in pull request #778: JOIN conditions are order dependent

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #778:
URL: https://github.com/apache/arrow-datafusion/pull/778#discussion_r676981952



##########
File path: datafusion/src/logical_plan/builder.rs
##########
@@ -282,14 +282,42 @@ impl LogicalPlanBuilder {
             ));
         }
 
+        let left_keys = left_keys
+            .into_iter()
+            .map(|c| c.into())
+            .collect::<Vec<Column>>();
+
+        let mut swap = false;
         let left_keys: Vec<Column> = left_keys
+            .clone()
             .into_iter()
-            .map(|c| c.into().normalize(&self.plan))
-            .collect::<Result<_>>()?;
+            .map(|c| c.normalize(&self.plan))
+            .collect::<Result<_>>()
+            .or_else(|_| {
+                swap = true;

Review comment:
       I have refactored this to operate at an individual condition level and I think the code is much more succinct.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] seddonm1 commented on pull request #778: JOIN conditions are order dependent

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #778:
URL: https://github.com/apache/arrow-datafusion/pull/778#issuecomment-887855793


   @alamb @houqp I have updated the PR with @alamb suggestion of a `join_keys: (Vec<impl Into<Column>>, Vec<impl Into<Column>>)` signature. I think this does make it less ambiguous than previously but does increase the surface area of this PR - so now needs additional scrutiny.
   
   Regarding @houqp dataframe suggestion, I am not sure how we proceed; I would expect users of the dataframe API to be much more capable of adapting to logically specifying keys in the 'correct' order vs SQL users who just want it to behave like any other SQL engine?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #778: JOIN conditions are order dependent

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #778:
URL: https://github.com/apache/arrow-datafusion/pull/778#issuecomment-888261558


   🎉  thanks again @seddonm1 !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] seddonm1 commented on a change in pull request #778: JOIN conditions are order dependent

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #778:
URL: https://github.com/apache/arrow-datafusion/pull/778#discussion_r677028886



##########
File path: datafusion/src/logical_plan/builder.rs
##########
@@ -282,14 +282,28 @@ impl LogicalPlanBuilder {
             ));
         }
 
-        let left_keys: Vec<Column> = left_keys
-            .into_iter()
-            .map(|c| c.into().normalize(&self.plan))
-            .collect::<Result<_>>()?;
-        let right_keys: Vec<Column> = right_keys
-            .into_iter()
-            .map(|c| c.into().normalize(right))
-            .collect::<Result<_>>()?;
+        let (left_keys, right_keys): (Vec<Result<Column>>, Vec<Result<Column>>) =

Review comment:
       > Welcome back @seddonm1 ! It is great to see a contribution from you. Thank you very much
   
   Thanks @alamb. Unfortunately my employment situation makes it hard to contribute too much but I am still very interested in seeing Datafusion succeed.
   
   The `LogicalPlanBuilder::join` has `left` and `right` keys but they are `left` and `right` relative to a join condition not to a `left` or `right` relation.
   
   The `left_keys` and `right_keys` are where the join conditions are resolved to underlying relations. The Postgres behavior does not require a specific order of conditions for joining so we need to be able to handle both conditions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org