You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/14 06:57:04 UTC

[GitHub] [arrow-datafusion] Ted-Jiang opened a new issue #1823: implement bitmap_distinct function using bitmap

Ted-Jiang opened a new issue #1823:
URL: https://github.com/apache/arrow-datafusion/issues/1823


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   Like #1115 implement `u8, i8, u16, i16, u32, i32` by using bitmap called `bitmap_distinct`
   
   **Describe the solution you'd like**
   I will use `arrow-bitmap`  and [roaring-bitmap](https://github.com/RoaringBitmap/roaring-rs)  and check the performance of both
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang commented on issue #1823: implement bitmap_distinct function using bitmap

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on issue #1823:
URL: https://github.com/apache/arrow-datafusion/issues/1823#issuecomment-1038768450


   i have implement a initial version get below result:
   1million_rows_10thousand_distinct.parquet
   
   ```
   1. count distint
   +----------------------------+
   | COUNT(DISTINCT test.value) |
   +----------------------------+
   | 10000                      |
   +----------------------------+
   1 row in set. Query took 0.237 seconds.
   
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   2. bitmap distinct (roaring-rs)
   
   +---------------------------------+
   | BITMAPCOUNTDISTINCT(test.value) |
   +---------------------------------+
   | 10000                           |
   +---------------------------------+
   1 row in set. Query took 0.052 seconds
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   3. approx distinct (hll)
   
   +----------------------------+
   | APPROXDISTINCT(test.value) |
   +----------------------------+
   | 9943                       |
   +----------------------------+
   1 row in set. Query took 0.047 seconds.
   
   ```
   
   the bitmap used is [this](https://github.com/RoaringBitmap/roaring-rs) 
   i have checked influx_iox use [croating-rs](https://github.com/saulius/croaring-rs)
   @alamb Sorry to bother you 😂, could you share some info why use croating-rs, if you have a bench result that would be fantastic 👍  ! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] liukun4515 commented on issue #1823: implement bitmap_distinct function using bitmap

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on issue #1823:
URL: https://github.com/apache/arrow-datafusion/issues/1823#issuecomment-1038800123


   > i have implement a initial version get below result: 1million_rows_10thousand_distinct.parquet
   > 
   > ```
   > 1. count distint
   > +----------------------------+
   > | COUNT(DISTINCT test.value) |
   > +----------------------------+
   > | 10000                      |
   > +----------------------------+
   > 1 row in set. Query took 0.237 seconds.
   > 
   > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   > 2. bitmap distinct (roaring-rs)
   > 
   > +---------------------------------+
   > | BITMAPCOUNTDISTINCT(test.value) |
   > +---------------------------------+
   > | 10000                           |
   > +---------------------------------+
   > 1 row in set. Query took 0.052 seconds
   > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   > 3. approx distinct (hll)
   > 
   > +----------------------------+
   > | APPROXDISTINCT(test.value) |
   > +----------------------------+
   > | 9943                       |
   > +----------------------------+
   > 1 row in set. Query took 0.047 seconds.
   > ```
   > 
   > the bitmap used is [this](https://github.com/RoaringBitmap/roaring-rs) i have checked influx_iox use [croating-rs](https://github.com/saulius/croaring-rs) @alamb Sorry to bother you 😂, could you share some info why use croating-rs, if you have a bench result that would be fantastic 👍 !
   
   Could you please file the draft of the pull request?
   @Ted-Jiang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #1823: implement bitmap_distinct function using bitmap

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1823:
URL: https://github.com/apache/arrow-datafusion/issues/1823#issuecomment-1040800234


    > @alamb Sorry to bother you 😂, could you share some info why use croating-rs, if you have a bench result that would be fantastic 👍 !
   
   IOx is using croaring for its "ReadBuffer" (an optimized in memory format we use rather than straight up RecordBatches). 
   @e-dard  is the expert. There is more information from the Feb 2 2021 tech talk: https://github.com/influxdata/influxdb_iox/tree/main/docs#iox-tech-talks
   
   Specifically:
   Recording:  https://www.youtube.com/watch?v=KslD31VNqPU
   Slides: https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-intro-to-the-influxdb-iox-read-buffer-a-readoptimized-inmemory-query-execution-engine
   
   Sorry for the delayed response


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang edited a comment on issue #1823: implement bitmap_distinct function using bitmap

Posted by GitBox <gi...@apache.org>.
Ted-Jiang edited a comment on issue #1823:
URL: https://github.com/apache/arrow-datafusion/issues/1823#issuecomment-1042597079


   Hi @e-dard Thanks a lot for your info! What an admirable work in IOx.
   I have done some test above,there is 2x boost for `croaring-rs` than `roaring-rs`.
   and according to `reliability ` you mentioned, they are strong evidences for using it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang commented on issue #1823: implement bitmap_distinct function using bitmap

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on issue #1823:
URL: https://github.com/apache/arrow-datafusion/issues/1823#issuecomment-1042597079


   @e-dard Thanks a lot for your info! What an admirable work in IOX.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang commented on issue #1823: implement bitmap_distinct function using bitmap

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on issue #1823:
URL: https://github.com/apache/arrow-datafusion/issues/1823#issuecomment-1039896766


   Test result of use `roaring-rs` and `croaring-rs` (use same logic: insert one value one time)
   
   1million_rows_10thousand_distinct.parquet
   ```
   bitmap distinct 
   
   (roaring-rs)
   +---------------------------------+
   | BITMAPCOUNTDISTINCT(test.value) |
   +---------------------------------+
   | 10000                           |
   +---------------------------------+
   1 row in set. Query took 0.052 seconds
   
   (croaring-rs)
   +---------------------------------+
   | BITMAPCOUNTDISTINCT(test.value) |
   +---------------------------------+
   | 10000                           |
   +---------------------------------+
   1 row in set. Query took 0.038 seconds.
   
   ```
   
   1million_1million.parquet
   ```
   roaring-rs
   +---------------------------------+
   | BITMAPCOUNTDISTINCT(test.value) |
   +---------------------------------+
   | 631504                          |
   +---------------------------------+
   1 row in set. Query took 0.175 seconds(roaring-rs).
   
   croaring-rs
   +---------------------------------+
   | BITMAPCOUNTDISTINCT(test.value) |
   +---------------------------------+
   | 631504                          |
   +---------------------------------+
   1 row in set. Query took 0.052 seconds (croaring-rs).
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] e-dard commented on issue #1823: implement bitmap_distinct function using bitmap

Posted by GitBox <gi...@apache.org>.
e-dard commented on issue #1823:
URL: https://github.com/apache/arrow-datafusion/issues/1823#issuecomment-1041369032


   Hey @Ted-Jiang!
   
   Nice to see some of these ideas making there way into Datafusion! I developed some of these ideas for IOx's Read Buffer happened in 2020.
   
   At the time I chose `croaring-rs` for a couple of reasons:
   
   - performance: I did some benchmarking and it was faster than the pure rust crate (sadly I can't find these benchmarks on my machine now).
   - reliability: `croaring-rs` wraps the officially maintained C/C++ version, which generally means it's a lower risk choice.
   
   The TLDR of how I use bitmaps in the Read Buffer is as follows:
   
    - constant time row identification for predicates that match `column op literal` (which is the vast majority for InfluxData's use-cases). When a user specifies one of these we already have a compressed bitmap of all matching rows available.
    - (very) late materialisation. After all predicates are applied to all columns in memory (generally only working on the compressed representations) then the bitsets are combined appropriately (intersected/unioned etc). Only then does the Read Buffer begin materialising rows into output record batches based on the ordinal offsets in the final bitmap.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Ted-Jiang commented on issue #1823: implement bitmap_distinct function using bitmap

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on issue #1823:
URL: https://github.com/apache/arrow-datafusion/issues/1823#issuecomment-1039896766


   Test result of use `roaring-rs` and `croaring-rs` (use same logic: insert one value one time)
   
   1million_rows_10thousand_distinct.parquet
   ```
   bitmap distinct 
   
   (roaring-rs)
   +---------------------------------+
   | BITMAPCOUNTDISTINCT(test.value) |
   +---------------------------------+
   | 10000                           |
   +---------------------------------+
   1 row in set. Query took 0.052 seconds
   
   (croaring-rs)
   +---------------------------------+
   | BITMAPCOUNTDISTINCT(test.value) |
   +---------------------------------+
   | 10000                           |
   +---------------------------------+
   1 row in set. Query took 0.038 seconds.
   
   ```
   
   1million_1million.parquet
   ```
   roaring-rs
   +---------------------------------+
   | BITMAPCOUNTDISTINCT(test.value) |
   +---------------------------------+
   | 631504                          |
   +---------------------------------+
   1 row in set. Query took 0.175 seconds(roaring-rs).
   
   croaring-rs
   +---------------------------------+
   | BITMAPCOUNTDISTINCT(test.value) |
   +---------------------------------+
   | 631504                          |
   +---------------------------------+
   1 row in set. Query took 0.052 seconds (croaring-rs).
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org