You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "duongcongtoai (via GitHub)" <gi...@apache.org> on 2023/03/25 15:02:39 UTC

[GitHub] [arrow-datafusion] duongcongtoai opened a new issue, #5738: Bug in HashJoin output_partition cause the input from left input not fully executed

duongcongtoai opened a new issue, #5738:
URL: https://github.com/apache/arrow-datafusion/issues/5738

   ### Describe the bug
   
   The HashJoinExec decides output_partition based on this function: https://github.com/apache/arrow-datafusion/blob/b7a33317c2abf265f4ab6b3fe636f87c4d01334c/datafusion/core/src/physical_plan/joins/utils.rs#L90
   
   If PartitionMode is set to Partitioned, join_type is RIGHT, output_partition will depend on output_partition of the right child, this may cause missing execution on left child partitions, if left child has more partitions than right child partition: https://github.com/apache/arrow-datafusion/blob/e87754cfe3afa4c358a8ca9c21c3c4acd020dfe5/datafusion/core/src/physical_plan/joins/hash_join.rs#L413
   
   ### To Reproduce
   
   [Code in this gist](https://gist.github.com/duongcongtoai/4d82074e20c0dfeca8c324bba8ad0e66)
   
   Create 2 ExecutionPlan input from csv with only 1 field "id" and create a HashJoinExec from these inputs. Because during the execution, some parition from the left input is not executed on, they are never probed with associated rows in the right input, so result in a false join:
   ```
   +----+----+
   | id | id |
   +----+----+
   |    | 2  |
   |    | 3  |
   |    | 6  |
   |    | 7  |
   |    | 9  |
   |    | 1  |
   |    | 4  |
   |    | 5  |
   |    | 8  |
   +----+----+
   ```
   
   
   
   ### Expected behavior
   
   HashJoin executes correctly
   ```
   +----+----+
   | id | id |
   +----+----+
   | 1  | 1  |
   | 9  | 9  |
   | 5  | 5  |
   | 8  | 8  |
   | 6  | 6  |
   | 7  | 7  |
   | 4  | 4  |
   | 2  | 2  |
   | 3  | 3  |
   +----+----+
   ```
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #5738: Bug in HashJoin output_partition cause the input from left input not fully executed

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on issue #5738:
URL: https://github.com/apache/arrow-datafusion/issues/5738#issuecomment-1484740520

   @duongcongtoai 
   Regarding more partitions, do you mean the input partition count is different among left input and right input ?
   Currently in DataFusion, we will insert a `RepartitionExec` for  left input and right input, after the `RepartitionExec`
   the partition count should be always the same. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] duongcongtoai commented on issue #5738: Bug in HashJoin output_partition cause the input from left input not fully executed

Posted by "duongcongtoai (via GitHub)" <gi...@apache.org>.
duongcongtoai commented on issue #5738:
URL: https://github.com/apache/arrow-datafusion/issues/5738#issuecomment-1489260511

   @mingmwang do we have special reason to validate at execution time instead of plan time? If the constraint is violated, we can avoid starting uncessary execution in partitioned tasks right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] duongcongtoai commented on issue #5738: Bug in HashJoin output_partition cause the input from left input not fully executed

Posted by "duongcongtoai (via GitHub)" <gi...@apache.org>.
duongcongtoai commented on issue #5738:
URL: https://github.com/apache/arrow-datafusion/issues/5738#issuecomment-1484760318

   Okay, let me open my first PR :D


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed issue #5738: Bug in HashJoin output_partition cause the input from left input not fully executed

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #5738: Bug in HashJoin output_partition cause the input from left input not fully executed
URL: https://github.com/apache/arrow-datafusion/issues/5738


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #5738: Bug in HashJoin output_partition cause the input from left input not fully executed

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on issue #5738:
URL: https://github.com/apache/arrow-datafusion/issues/5738#issuecomment-1484655137

   I will take a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] duongcongtoai commented on issue #5738: Bug in HashJoin output_partition cause the input from left input not fully executed

Posted by "duongcongtoai (via GitHub)" <gi...@apache.org>.
duongcongtoai commented on issue #5738:
URL: https://github.com/apache/arrow-datafusion/issues/5738#issuecomment-1484748545

   Thank you. Shoud we have a small validation in new function to notify users about this constraint?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #5738: Bug in HashJoin output_partition cause the input from left input not fully executed

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on issue #5738:
URL: https://github.com/apache/arrow-datafusion/issues/5738#issuecomment-1484753572

   @duongcongtoai Agree.  We should add a validation to enforce this constraint.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #5738: Bug in HashJoin output_partition cause the input from left input not fully executed

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on issue #5738:
URL: https://github.com/apache/arrow-datafusion/issues/5738#issuecomment-1484779674

   > Okay, let me open my first PR :D
   
   Sure, I think you can add the input partition count check during the real execution time(not plan time) in the `execute()`method.
   And I think both the `HashJoinExec` and `SortMergeJoinExec` will need this check. I'm not sure for `SymmetricHashJoinExec` whether the check is required or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org