You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/02 11:18:30 UTC

[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #235: Failing tests in master: left_join_using and left_join

Dandandan edited a comment on issue #235:
URL: https://github.com/apache/arrow-datafusion/issues/235#issuecomment-830792427


   Thanks @jorgecarleitao
   
   I added an implementation of left join where unmatched left rows are produced at the end of a stream.
   I'm not totally sure what you mean, I think we still have to keep track of rows that didn't match any row at the left side.
   For inner joins or on the the right/left part of a join for respectively left/right joins, we could indeed add a null filter on the columns, but this would be more of an optimization to push down null filters as far as possible. I think this is something Spark does too.
   
   I think there might be some possible improvements:
   
   * Use a bitmap structure instead of `Vec<bool>`. Efficiency-wise, the current PR should already be a large improvement though (don't have any benchmarks to prove it ATM, but a new hashset for each batch seems like it will be quite slow).
   * Generate the unmatched rows in batches with the configured batch size. Currently, it generates them in "one go".
   
   @andygrove this also seems to fix the tests in this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org