You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/27 21:37:25 UTC

[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #1652: ARROW2: Performance benchmark

jorgecarleitao commented on issue #1652:
URL: https://github.com/apache/arrow-datafusion/issues/1652#issuecomment-1023661427

* arrow-rs: group filter push down
* arrow2: group filter push down, page filter push down

afaik both support reading and writing dictionary encoded arrays to dictionary-encoded column chunks (but neither supports pushdown based on dict values atm).

TBH, imo the biggest limiting factor in implementing anything in parquet is its lack of documentation - it is just a lot of work to decipher what is being requested, and the situation is not improving. For example, I spent a lot of time in understanding the encodings, have [a PR](https://github.com/apache/parquet-format/pull/170) to try to help future folks implementing it, and it has been lingering for ~9 months now. I wish parquet was a bit more inspired by e.g. Arrow or ORC in this respect.

Note that datafusion's benchmarks only use "required" / non-nullable data, so most optimizations on the null values are not seen from datafusion's perspective. Last time I benched [arrow2/parquet2 was much faster](https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=1919295045) in nullable data; I am a bit surprised to see so many differences in non-nullable data.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org