You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/01 15:45:28 UTC

[GitHub] [arrow-datafusion] andygrove opened a new issue #235: Failing tests in master: left_join_using and left_join

andygrove opened a new issue #235:
URL: https://github.com/apache/arrow-datafusion/issues/235


   **Describe the bug**
   ```
   ---- left_join_using stdout ----
   thread 'left_join_using' panicked at 'assertion failed: `(left == right)`
     left: `[["11", "a", "z"], ["22", "b", "y"], ["33", "c", "NULL"], ["44", "d", "x"]]`,
    right: `[["11", "a", "z"], ["22", "b", "y"], ["44", "d", "x"]]`', datafusion/tests/sql.rs:1252:5
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   
   ---- left_join stdout ----
   thread 'left_join' panicked at 'assertion failed: `(left == right)`
     left: `[["11", "a", "z"], ["22", "b", "y"], ["33", "c", "NULL"], ["44", "d", "x"]]`,
    right: `[["11", "a", "z"], ["22", "b", "y"], ["44", "d", "x"]]`', datafusion/tests/sql.rs:1221:5
   ```
   
   **To Reproduce**
   Checkout commit `c945b03f3a459a5c15f481f9d52819df56e1090c` and run `cargo test`.
   
   **Expected behavior**
   Tests should pass.
   
   **Additional context**
   Running on my 24 core desktop so perhaps this uncovers a race condition?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #235: Failing tests in master: left_join_using and left_join

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on issue #235:
URL: https://github.com/apache/arrow-datafusion/issues/235#issuecomment-830784725


   I was thinking (this may be completely off): 
   
   Since nulls are generally ignored on joins, can't we fix the null stuff on joins by a "drop_null" operation on both sides, prior to any hashing and actual join?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #235: Failing tests in master: left_join_using and left_join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #235:
URL: https://github.com/apache/arrow-datafusion/issues/235#issuecomment-830792427


   Thanks @jorgecarleitao
   
   I added an implementation of left join where unmatched left rows are produced at the end of a stream.
   I think there might be some possible improvements:
   
   * Use a bitmap structure instead of `Vec<bool>`. Efficiency-wise, the current PR should already be a large improvement though (don't have any benchmarks to prove it ATM, but a new hashset for each batch seems like it will be quite slow).
   * Generate the unmatched rows in batches with the configured batch size. Currently, it generates them in "one go".
   
   @andygrove this also seems to fix the tests in this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #235: Failing tests in master: left_join_using and left_join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #235:
URL: https://github.com/apache/arrow-datafusion/issues/235#issuecomment-830783405


   I reproduced the bug by explicitly setting the concurrency for those tests to `24`.
   
   So here is my hypothesis into what happens:
   
   * The left join has a wrong implementation in that it will produce rows when they are missing in the right batch, instead of in the entire partition.
   * The referenced commit has some changes to (re)hashing of single columns, which means that columns could end up without any right-side rows.
   * We also use the same hashing code in hash-repartition which means that the `33` row could end up in its "own" partition. In that case, no right batch is being processed, so no row is being generated for `33`.
   
   I have a feeling that to fix this in the general case it would be best to "just" fix the left join implementation.
   
   Another option would be maybe to cherry-pick this change which would fix just this test from PR #55:
   
   https://github.com/apache/arrow-datafusion/pull/55/files#diff-44d49c7778aa0c300afacdd7d89b0729ffaedd932d1ac34f3ef8db6b6cdfd73aR904


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #235: Failing tests in master: left_join_using and left_join

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on issue #235:
URL: https://github.com/apache/arrow-datafusion/issues/235#issuecomment-830792427


   Thanks @jorgecarleitao
   
   I added an implementation of left join where unmatched left rows are produced at the end of a stream.
   I'm not totally sure what you mean, I think we still have to keep track of rows that didn't match any row at the left side.
   For inner joins or on the the right/left part of a join for respectively left/right joins, we could indeed add a null filter on the columns, but this would be more of an optimization to push down null filters as far as possible. I think this is something Spark does too.
   
   I think there might be some possible improvements in the current implementation:
   
   * Use a bitmap structure instead of `Vec<bool>`. Efficiency-wise, the current PR should already be a large improvement though (don't have any benchmarks to prove it ATM, but a new hashset for each batch seems like it will be quite slow).
   * Generate the unmatched rows in batches with the configured batch size. Currently, it generates them in "one go".
   
   @andygrove this also seems to fix the tests in this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #235: Failing tests in master: left_join_using and left_join

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on issue #235:
URL: https://github.com/apache/arrow-datafusion/issues/235#issuecomment-830792427


   Thanks @jorgecarleitao
   
   I added an implementation of left join where unmatched left rows are produced at the end of a stream.
   I'm not totally sure what you mean, I think we still have to keep track of rows that didn't match any row at the left side.
   For inner joins or on the the right/left part of a join for respectively left/right joins, we could indeed add a null filter on the columns, but this would be more of an optimization to push down null filters as far as possible. I think this is something Spark does too.
   
   I think there might be some possible improvements in the current implementation:
   
   * Use a bitmap structure instead of `Vec<bool>`. Efficiency-wise, the current PR should already be a large improvement though (don't have any benchmarks to prove it ATM, but a new hashset for each batch seems like it will be quite slow).
   * Generate the unmatched rows in batches with the configured batch size. Currently, it generates them in "one go".
   
   @andygrove this also seems to fix the tests in this issue, would be nice if you could confirm this on the
   
   More generally, maybe we should run the sql tests in some different settings (concurrency / optimizations, etc) to do some more exhaustive checking using all of the different configurations / environmental changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #235: Failing tests in master: left_join_using and left_join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #235:
URL: https://github.com/apache/arrow-datafusion/issues/235#issuecomment-830664200


   Hm. I will have a look tomorrow if someone doesn't beat me to it. There is also a known issue with the left join being wrong across multiple batches, so it could that the fix included in this particular commit triggers the issue in those tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #235: Failing tests in master: left_join_using and left_join

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on issue #235:
URL: https://github.com/apache/arrow-datafusion/issues/235#issuecomment-830792427


   Thanks @jorgecarleitao
   
   I added an implementation of left join where unmatched left rows are produced at the end of a stream.
   I'm not totally sure what you mean, I think we still have to keep track of rows that didn't match any row at the left side.
   For inner joins or on the the right/left part of a join for respectively left/right joins, we could indeed add a null filter on the columns, but this would be more of an optimization to push down null filters as far as possible. I think this is something Spark does too.
   
   I think there might be some possible improvements:
   
   * Use a bitmap structure instead of `Vec<bool>`. Efficiency-wise, the current PR should already be a large improvement though (don't have any benchmarks to prove it ATM, but a new hashset for each batch seems like it will be quite slow).
   * Generate the unmatched rows in batches with the configured batch size. Currently, it generates them in "one go".
   
   @andygrove this also seems to fix the tests in this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan closed issue #235: Failing tests in master: left_join_using and left_join

Posted by GitBox <gi...@apache.org>.
Dandandan closed issue #235:
URL: https://github.com/apache/arrow-datafusion/issues/235


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andygrove commented on issue #235: Failing tests in master: left_join_using and left_join

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #235:
URL: https://github.com/apache/arrow-datafusion/issues/235#issuecomment-830655860


   The regression seems to have been introduced in this commit (or at least I start to see it with this commit).
   
   ```
   commit 33715747db5a3cc48936c7b26859d4ad5809cde8 (HEAD)
   Author: Daniƫl Heres <da...@gmail.com>
   Date:   Wed Apr 28 12:28:16 2021 +0200
   
       [DataFusion] Optimize hash join inner workings, null handling fix (#24)
   ```
   
   @Dandandan fyi


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org