You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/07 23:27:01 UTC
[GitHub] [arrow-rs] yordan-pavlov edited a comment on issue #171: Implement returning dictionary arrays from parquet reader

yordan-pavlov edited a comment on issue #171:
URL: https://github.com/apache/arrow-rs/issues/171#issuecomment-988336528


   @tustvold thank you for looking into this, and for the excellent summary of the parquet reader stack - this should probably go in documentation somewhere as it takes a while to figure out.
   
   The main reason for the stalling of work on the `ArrowArrayReader` is that a big change happened in my personal life - I had a baby born, and as much as I would like to spend more time on this project I have much less free time now. I hope that in a few months, I will have some more free time and will be able to contribute again. The other reason is that although I was able to make the new `ArrowArrayReader` a several times faster for string arrays, and this appears to bring some nice performance improvements in total execution time (the old `ArrayReader` is slow for string arrays), I was struggling to make it faster in all cases for primitive arrays. I had some ideas (e.g. make the column chunk context a self referential struct so that a dictionary could be built more efficiently from the page buffer by avoiding unnecessary memory copies) but the baby came before I could finish that.
   
   Here are my thoughts on preserving dictionary arrays:
   * performance as a result of dictionary array preservation depends very much on upstream processing (e.g. can filter methods be implemented that can benefit from a dictionary array by e.g. making better use of SIMD, how much of the larger query can be processed before unpacking the dictionary) - I tried to do some synthetic performance tests to measure the impact of unpacking the dictionary at different stages of query processing (including filter operators that can make use of the dictionary), but couldn't see (as far as I can remember) the performance improvements I was expecting; may be my setup was flawed, results might be different with actual code
   * I wonder if any (or both) of the proposed two new config values `delimit_row_groups` and `preserve_dictionaries` can be enabled / disabled automatically (e.g. based on query, data source, etc.) so that most of the time they don't need to be changed; my thinking is the default configuration / implementation should work best in most cases and settings should only have to be changed very rarely, under very specific circumstances and by someone who knows very well what they are doing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org