You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/30 20:48:56 UTC

[GitHub] [arrow-rs] yordan-pavlov commented on issue #1111: ArrowArrayReader Reads Too Many Values From Bit-Packed Runs

yordan-pavlov commented on issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003176898


   @tustvold 's work on making the "old" arrow reader faster looks promising, plus I've hit a wall in making the `ArrowArrayReader` faster in some cases, so I agree that it doesn't make sense to try to fix the `ArrowArrayReader` and instead it makes more sense to focus on finishing the work in #1082 .
   
   `ArrowArrayReader` does use def levels to calculate how many values to read, see here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L576 , but reading def levels is done independently of reading of values (one of the main characteristics of `ArrowArrayReader`) and so if the value of `num_values` field from `Page` is incorrect and does indeed include NULL values, then the entire idea of `ArrowArrayReader` falls apart and it won't work correctly.
   
   One reason why I implemented `ArrowArrayReader` in a new struct so that it can easily be swapped out if was necessary, so I guess that time has come. Yes - it will result in a significant reduction in performance of reading string arrays, but correctness is more important than speed and also it would be only until #1082 is finished.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org