You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/05 07:00:44 UTC

[GitHub] [arrow-datafusion] xudong963 opened a new pull request #1402: Transfer right join to left join

xudong963 opened a new pull request #1402:
URL: https://github.com/apache/arrow-datafusion/pull/1402


   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   No
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   This ticket transfers `Right Join` to `Left Join`. This will facilitate logic optimization, including predicate push-down, join order exchange, and some future optimization, with one less processing case, and our code logic will also be simplified.
   
   BTW, postgres also did the thing.
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   No
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1402: Transfer right join to left join

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1402:
URL: https://github.com/apache/arrow-datafusion/pull/1402#issuecomment-987873476


   Thanks @xudong963  -- I appreciate your flexibility and understanding


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on pull request #1402: Transfer right join to left join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #1402:
URL: https://github.com/apache/arrow-datafusion/pull/1402#issuecomment-987674516


   > > This usage becomes harder when we choose to not respect the SQL syntax and instead encode a right join as a left join even if the user carefully picked the smaller left side in his query?
   > 
   > If I read the source code rightly, we'll optimize the smaller side in a join as the left side during the physical plan phrase (even if the user doesn't do it in SQL)? 
   > 
   > So I think we should reserve `Right Join` in `Join Type`, and then it'll be used during the physical plan phrase. In addition, our dataframe API  users also need the `Right Join` in `Join Type`.
   > 
   > I think this ticket will **only** facilitate logic optimization,  with one less processing case, and our code logic will also be simplified.
   
   That's true - but a user can also disable that specific optimization in the configuration if he wants to make sure the order is preserved. After all, the reordering is/might be based on heuristics instead of complete knowledge about the biggest / smallest side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1402: Transfer right join to left join

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1402:
URL: https://github.com/apache/arrow-datafusion/pull/1402#discussion_r763498003



##########
File path: datafusion/src/logical_plan/plan.rs
##########
@@ -38,7 +38,7 @@ pub enum JoinType {
     Inner,
     /// Left Join
     Left,
-    /// Right Join
+    /// Right Join: will be transferred to Left Join

Review comment:
       This is a good idea @xudong963  -- I do think it will keep things somewhat easier to handle.
   
   If we are going to make this change, I think we should remove the `Right` enum value from `JoinType` 
   
   The only possibly issue I can see with this change is that is may complicate the join implementation (which needs to distinguish between both which relation needs to be preserved and which gets hashed)
   
   Using  `Left` and `Right` enum values might allow us to maintain that distinction 
   
   So for example
   
   ```
   A LEFT JOIN B    --> Means preserve B, hash B
   B LEFT JOIN A    --> Means preserve A, hash A
   B RIGHT JOIN A --> Means preserve B, hash A
   A RIGHT JOIN B --> Means preserve A, hash B
   ```
   
   
   Thoughts @Dandandan ?

##########
File path: datafusion/src/logical_plan/plan.rs
##########
@@ -38,7 +38,7 @@ pub enum JoinType {
     Inner,
     /// Left Join
     Left,
-    /// Right Join
+    /// Right Join: will be transferred to Left Join

Review comment:
       This is a good idea @xudong963  -- I do think it will keep things somewhat easier to handle.
   
   If we are going to make this change, I think we should remove the `Right` enum value from `JoinType` 
   
   The only possibly issue I can see with this change is that is may complicate the join implementation (which needs to distinguish between both which relation needs to be preserved and which gets hashed)
   
   Using  `Left` and `Right` enum values might allow us to maintain that distinction 
   
   So for example
   
   ```
   A LEFT JOIN B    --> Means preserve B, hash B
   B LEFT JOIN A    --> Means preserve A, hash A
   B RIGHT JOIN A   --> Means preserve B, hash A
   A RIGHT JOIN B   --> Means preserve A, hash B
   ```
   
   
   Thoughts @Dandandan ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on pull request #1402: Transfer right join to left join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #1402:
URL: https://github.com/apache/arrow-datafusion/pull/1402#issuecomment-987615421


   I am not sure we would want to do this?
   
   As a user that might want more control over the execution and join order (to save time, memory in a specific query) currently can help DataFusion a bit by using the smaller side in a join as the left side, or even disable the optimization rule to have full control.
   
   This usage becomes harder when we choose to not respect the SQL syntax and instead encode a right join as a left join even if the user carefully picked the smaller left side in his query?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on pull request #1402: Transfer right join to left join

Posted by GitBox <gi...@apache.org>.
xudong963 commented on pull request #1402:
URL: https://github.com/apache/arrow-datafusion/pull/1402#issuecomment-987868151


   > I agree with @Dandandan that keeping the `Left` and `Right` enum and not trying to change from `Right` to `Left` implicitly is probably the best thing to do for now as it allows users to explicitly control what the join order is.
   
   Ok, I'll close the ticket.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 closed pull request #1402: Transfer right join to left join

Posted by GitBox <gi...@apache.org>.
xudong963 closed pull request #1402:
URL: https://github.com/apache/arrow-datafusion/pull/1402


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1402: Transfer right join to left join

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1402:
URL: https://github.com/apache/arrow-datafusion/pull/1402#issuecomment-987865048


   I agree with @Dandandan  that keeping the `Left` and `Right` enum and not trying to change from `Right` to `Left` implicitly is probably the best thing to do for now as it allows users to explicitly control what the join order is. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #1402: Transfer right join to left join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #1402:
URL: https://github.com/apache/arrow-datafusion/pull/1402#discussion_r763674180



##########
File path: datafusion/src/logical_plan/plan.rs
##########
@@ -38,7 +38,7 @@ pub enum JoinType {
     Inner,
     /// Left Join
     Left,
-    /// Right Join
+    /// Right Join: will be transferred to Left Join

Review comment:
       Won't removing the right join effectively mean we can't reorder a left join to a right join?
   
   When a left join has a bigger side on the left it means we would like to swap the sides and the join implementation.
   
   Besides hashing there is also the producing rows that didn't match. For a left join this is done at the end, while for the right join this can be done for each `RecordBatch`.
   
   It might be a bit strange to encode a different left/right enum, because for inner, outer, etc. that doesn't make a lot of sense?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on pull request #1402: Transfer right join to left join

Posted by GitBox <gi...@apache.org>.
xudong963 commented on pull request #1402:
URL: https://github.com/apache/arrow-datafusion/pull/1402#issuecomment-987653202


   > This usage becomes harder when we choose to not respect the SQL syntax and instead encode a right join as a left join even if the user carefully picked the smaller left side in his query?
   
   If I read the source code rightly, we'll optimize the smaller side in a join as the left side during the physical plan phrase (even if the user doesn't do it in SQL)? 
   
   So I think we should reserve `Right Join` in `Join Type`, and then it'll be used during the physical plan phrase. In addition, our dataframe API  users also need the `Right Join` in `Join Type`.
   
   I think this ticket will **only** facilitate logic optimization,  with one less processing case, and our code logic will also be simplified.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1402: Transfer right join to left join

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1402:
URL: https://github.com/apache/arrow-datafusion/pull/1402#discussion_r763928259



##########
File path: datafusion/src/logical_plan/plan.rs
##########
@@ -38,7 +38,7 @@ pub enum JoinType {
     Inner,
     /// Left Join
     Left,
-    /// Right Join
+    /// Right Join: will be transferred to Left Join

Review comment:
       Yes I think you are right that if we removed the `Right` variant then we would have to add some other way to encode the "which to hash"
   
   I find the standard terminology confusing because there is an (implicit) optimization for one of the two join inputs in most implementations (e.g. hash join, which one to hash) that is often encoded as "which appears as the left argument in the tree" which has a different meaning than a `LEFT JOIN` 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #1402: Transfer right join to left join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #1402:
URL: https://github.com/apache/arrow-datafusion/pull/1402#discussion_r763674180



##########
File path: datafusion/src/logical_plan/plan.rs
##########
@@ -38,7 +38,7 @@ pub enum JoinType {
     Inner,
     /// Left Join
     Left,
-    /// Right Join
+    /// Right Join: will be transferred to Left Join

Review comment:
       Won't removing the right join effectively mean we can't reorder a left join to a right join?
   
   When a left join has a bigger side on the left it means we would like to swap the sides and the join implementation.
   
   Besides hashing there is also the producing rows that didn't match. For a left join this is done at the end, while for the right join this can be done for each `RecordBatch`.
   
   It might be a bit strange to encode a different left/right enum, because for inner, full outer, etc. that doesn't make a lot of sense?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org