You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/15 14:44:05 UTC

[GitHub] [arrow-datafusion] boaz-codota opened a new pull request #342: Left join could use bitmap for left join instead of Vec

boaz-codota opened a new pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342


   # Which issue does this PR close?
   Closes #240 .
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   Described in the issue.
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   Described in the issue.
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   No user facing changes.
   <!--
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] boaz-codota commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
boaz-codota commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-841839620


   @alamb @jorgecarleitao I'm trying to replace the usage of bitvec with arrow's structures. Both `BooleanBufferBuilder` & `Bitmap` seems to behave like immutable data structures, which does not support the required use case (specifically: https://github.com/apache/arrow-datafusion/pull/342/files#diff-44d49c7778aa0c300afacdd7d89b0729ffaedd932d1ac34f3ef8db6b6cdfd73aR1084). So I'd love to change my code to use arrow's structures, but those two does not seem to support what is required.
   I looked on other array structures under arrow, but none of them seem to fit as well (https://docs.rs/arrow/4.0.0/arrow/array/index.html).
   Let me know how I should advance from here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] boaz-codota commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
boaz-codota commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-899003871


   Implemented in an arrow native way here: https://github.com/apache/arrow-datafusion/pull/884


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-870484980


   > I'm not sure whether or not I can consume the BooleanBufferBuilder within HashJoinStream.poll_next. Can I?
   
   I am not sure
   
   > If not, I guess I'd add another PR to BooleanBufferBuilder that exposes get_bit
   
   One thing that might be helpful would be to prototype out this PR using a local checkout / fork of the arrow-rs repo so that you can get everything working and find other potential "gotchas" without having to wait on the arrow-rs release cycle
   
   Once you figured out what you needed then we could merge it according in parts


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] boaz-codota commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
boaz-codota commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-868400691


   @alamb tried to refactor the code now, noticed that there is no `get_bit` equivalent of `set_bit` in `BooleanBufferBuilder` and calling `finish` to make receive a `Buffer` is consuming the content of the `BooleanBufferBuilder`. I'm not sure whether or not I can consume the `BooleanBufferBuilder` within `HashJoinStream.poll_next`. Can I? If not, I guess I'd add another PR to `BooleanBufferBuilder` that exposes `get_bit` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-842621505


   Yes, I think access to the middle only could be done by accessing  / mutating the contents, in which case you are already close to rolling your own bitvec API. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-841780393


   > Thanks a lot @boaz-codota !
   > 
   > Is there any performance difference? imo It would be good to measure before adding a dependency.
   > 
   > Note that arrow also has an `BooleanBufferBuilder` or something that can create a bitmap.
   
   The idea is here is not performance per se, but memory savings. As this is an in-memory join, any saved memory is welcome, as it will increase the chance a given join will fit in memory.
   
   Using a boolean array buffer builder makes sense to me to avoid the dependency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-841685880


   > @Dandandan I hope I understood the requested change correctly. Was not familiar with bitvec before, but I used the docs and I think I implemented it correctly
   
   Yes, this is awesome, exactly what I meant! Impressive how little is changed.
   
   I will do some benchmarking tonight or tomorrow to see if it's changing anything (but I believe this part is not the most performance sensitive anyway).
   
   @alamb @andygrove @jorgecarleitao 
   What do you think about adding bitvec as dependency? It might be useful later for other datastructures / algorithms too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] boazberman commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
boazberman commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-851049746


   I'm stuck on this, I've implemented the code which I think that should work, but it is not working as expected. I opened the PR to get help.
   https://github.com/apache/arrow-rs/pull/383
   cc @alamb @Dandandan @jorgecarleitao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-870484980


   > I'm not sure whether or not I can consume the BooleanBufferBuilder within HashJoinStream.poll_next. Can I?
   
   I am not sure
   
   > If not, I guess I'd add another PR to BooleanBufferBuilder that exposes get_bit
   
   One thing that might be helpful would be to prototype out this PR using a local checkout / fork of the arrow-rs repo so that you can get everything working and find other potential "gotchas" without having to wait on the arrow-rs release cycle
   
   Once you figured out what you needed then we could merge it according in parts


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-842619606


   Ah, sorry, I see; you need mutable access to an item in the middle of the builder; yeap, that is currently not supported. It assumes that it is always pushed/advanced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] boaz-codota commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
boaz-codota commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-862112986


   I'll refactor this as soon as I have some free time. Thanks @alamb !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-842632043


   I also am hesitant to add a new dependency to DataFusion without some sort of compelling performance or memory improvement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-841687494


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/342?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#342](https://codecov.io/gh/apache/arrow-datafusion/pull/342?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (00b4a84) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/1702d6c85ebfdbc968b1dc427a9799e74b64ff96?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (1702d6c) will **decrease** coverage by `0.00%`.
   > The diff coverage is `71.42%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/342/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/342?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #342      +/-   ##
   ==========================================
   - Coverage   75.72%   75.71%   -0.01%     
   ==========================================
     Files         143      143              
     Lines       23832    23835       +3     
   ==========================================
   + Hits        18046    18047       +1     
   - Misses       5786     5788       +2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/342?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/342/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9oYXNoX2pvaW4ucnM=) | `86.14% <71.42%> (-0.26%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/342?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/342?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [1702d6c...00b4a84](https://codecov.io/gh/apache/arrow-datafusion/pull/342?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] boaz-codota closed pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
boaz-codota closed pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-841762453


   Thanks a lot @boaz-codota !
   
   Is there any performance difference? imo It would be good to measure before adding a dependency.
   
   Note that arrow also has an `BooleanBufferBuilder` or something that can create a bitmap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-841762453


   Thanks a lot @boaz-codota !
   
   Is there any performance difference? imo It would be good to measure before adding a dependency.
   
   Note that arrow also has an `BooleanBufferBuilder` or something that can create a bitmap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-860786614


   FYI arrow 4.3.0 has been released with the code in BooleanBufferBuilder


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] boazberman commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
boazberman commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-843828945


   I'll try to tackle adding the set_bit to Arrow's `BooleanBufferBuilder` later this week, hopefully contributing to arrow-rs will be as easy as contributing to this repo


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] boaz-codota commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
boaz-codota commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-841839620


   @alamb @jorgecarleitao I'm trying to replace the usage of bitvec with arrow's structures. Both `BooleanBufferBuilder` & `Bitmap` seems to behave like immutable data structures, which does not support the required use case (specifically: https://github.com/apache/arrow-datafusion/pull/342/files#diff-44d49c7778aa0c300afacdd7d89b0729ffaedd932d1ac34f3ef8db6b6cdfd73aR1084). So I'd love to change my code to use arrow's structures, but those two does not seem to support what is required.
   I looked on other array structures under arrow, but none of them seem to fit as well (https://docs.rs/arrow/4.0.0/arrow/array/index.html).
   Let me know how I should advance from here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-842623668


   To be fair, it is fairly easy to enable that on the `BooleanBufferBuilder`; we need to add a method to set the bit using `bit_utils::set_bit`. `Buffer` for validity is effectively an immutable bitvec, and `BooleanBufferBuilder` is the corresponding mutable version; we just haven't added that functionality.
   
   I find the bitvec a bit overkill for a single use-case of a bitmap, specially when we already have a "bitvec" on arrow, but I am also fine pushing this through and leave this problem for later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] boaz-codota commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
boaz-codota commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-841669557


   @Dandandan I hope I understood the requested change correctly. Was not familiar with bitvec before, but I used the docs and I think I implemented it correctly 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-841780393


   > Thanks a lot @boaz-codota !
   > 
   > Is there any performance difference? imo It would be good to measure before adding a dependency.
   > 
   > Note that arrow also has an `BooleanBufferBuilder` or something that can create a bitmap.
   
   The idea is here is not performance per se, but memory savings. As this is an in-memory join, any saved memory is welcome, as it will increase the chance a given join will fit in memory.
   
   Using a boolean array buffer builder makes sense to me to avoid the dependency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-842615413


   I agree both the datastructures don't seem very useful for this use case. I don't think there is a mutable structure like bitvec in Arrow...
   WDYT @alamb @jorgecarleitao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on pull request #342: Left join could use bitmap for left join instead of Vec

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #342:
URL: https://github.com/apache/arrow-datafusion/pull/342#issuecomment-842617900


   It is just [buried](https://github.com/apache/arrow-rs/blob/master/arrow/src/array/builder.rs#L287) in the catacombs of the `builder.rs` 💀💀💀 `BooleanBufferBuilder` is its name.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org