You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "duongcongtoai (via GitHub)" <gi...@apache.org> on 2023/03/25 15:02:39 UTC
[GitHub] [arrow-datafusion] duongcongtoai opened a new issue, #5738: Bug in HashJoin output_partition cause the input from left input not fully executed
duongcongtoai opened a new issue, #5738:
URL: https://github.com/apache/arrow-datafusion/issues/5738
### Describe the bug
The HashJoinExec decides output_partition based on this function: https://github.com/apache/arrow-datafusion/blob/b7a33317c2abf265f4ab6b3fe636f87c4d01334c/datafusion/core/src/physical_plan/joins/utils.rs#L90
If PartitionMode is set to Partitioned, join_type is RIGHT, output_partition will depend on output_partition of the right child, this may cause missing execution on left child partitions, if left child has more partitions than right child partition: https://github.com/apache/arrow-datafusion/blob/e87754cfe3afa4c358a8ca9c21c3c4acd020dfe5/datafusion/core/src/physical_plan/joins/hash_join.rs#L413
### To Reproduce
[Code in this gist](https://gist.github.com/duongcongtoai/4d82074e20c0dfeca8c324bba8ad0e66)
Create 2 ExecutionPlan input from csv with only 1 field "id" and create a HashJoinExec from these inputs. Because during the execution, some parition from the left input is not executed on, they are never probed with associated rows in the right input, so result in a false join:
```
+----+----+
| id | id |
+----+----+
| | 2 |
| | 3 |
| | 6 |
| | 7 |
| | 9 |
| | 1 |
| | 4 |
| | 5 |
| | 8 |
+----+----+
```
### Expected behavior
HashJoin executes correctly
```
+----+----+
| id | id |
+----+----+
| 1 | 1 |
| 9 | 9 |
| 5 | 5 |
| 8 | 8 |
| 6 | 6 |
| 7 | 7 |
| 4 | 4 |
| 2 | 2 |
| 3 | 3 |
+----+----+
```
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] mingmwang commented on issue #5738: Bug in HashJoin output_partition cause the input from left input not fully executed
Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on issue #5738:
URL: https://github.com/apache/arrow-datafusion/issues/5738#issuecomment-1484740520
@duongcongtoai
Regarding more partitions, do you mean the input partition count is different among left input and right input ?
Currently in DataFusion, we will insert a `RepartitionExec` for left input and right input, after the `RepartitionExec`
the partition count should be always the same.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] duongcongtoai commented on issue #5738: Bug in HashJoin output_partition cause the input from left input not fully executed
Posted by "duongcongtoai (via GitHub)" <gi...@apache.org>.
duongcongtoai commented on issue #5738:
URL: https://github.com/apache/arrow-datafusion/issues/5738#issuecomment-1489260511
@mingmwang do we have special reason to validate at execution time instead of plan time? If the constraint is violated, we can avoid starting uncessary execution in partitioned tasks right?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] duongcongtoai commented on issue #5738: Bug in HashJoin output_partition cause the input from left input not fully executed
Posted by "duongcongtoai (via GitHub)" <gi...@apache.org>.
duongcongtoai commented on issue #5738:
URL: https://github.com/apache/arrow-datafusion/issues/5738#issuecomment-1484760318
Okay, let me open my first PR :D
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb closed issue #5738: Bug in HashJoin output_partition cause the input from left input not fully executed
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #5738: Bug in HashJoin output_partition cause the input from left input not fully executed
URL: https://github.com/apache/arrow-datafusion/issues/5738
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] mingmwang commented on issue #5738: Bug in HashJoin output_partition cause the input from left input not fully executed
Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on issue #5738:
URL: https://github.com/apache/arrow-datafusion/issues/5738#issuecomment-1484655137
I will take a look.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] duongcongtoai commented on issue #5738: Bug in HashJoin output_partition cause the input from left input not fully executed
Posted by "duongcongtoai (via GitHub)" <gi...@apache.org>.
duongcongtoai commented on issue #5738:
URL: https://github.com/apache/arrow-datafusion/issues/5738#issuecomment-1484748545
Thank you. Shoud we have a small validation in new function to notify users about this constraint?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] mingmwang commented on issue #5738: Bug in HashJoin output_partition cause the input from left input not fully executed
Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on issue #5738:
URL: https://github.com/apache/arrow-datafusion/issues/5738#issuecomment-1484753572
@duongcongtoai Agree. We should add a validation to enforce this constraint.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] mingmwang commented on issue #5738: Bug in HashJoin output_partition cause the input from left input not fully executed
Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on issue #5738:
URL: https://github.com/apache/arrow-datafusion/issues/5738#issuecomment-1484779674
> Okay, let me open my first PR :D
Sure, I think you can add the input partition count check during the real execution time(not plan time) in the `execute()`method.
And I think both the `HashJoinExec` and `SortMergeJoinExec` will need this check. I'm not sure for `SymmetricHashJoinExec` whether the check is required or not.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org