You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/06 16:51:48 UTC

[GitHub] [arrow] zeroshade commented on pull request #13277: ARROW-16638: [Go][Parquet] Fix skipping large number of rows in boolean columns

zeroshade commented on PR #13277:
URL: https://github.com/apache/arrow/pull/13277#issuecomment-1147662934

   @mdepero In the `parquet/internal/utils` package there is a function `BytesToBools` which explicitly is an efficient conversion from bitpacked bytes to a `[]bool`. It assumes that the slices are already sized appropriately (`len(out)` should equal `len(in)*8`).
   
   That being said, there might be a way we can get around this by changing the implementation slightly. 
   
   Since the `columnChunkReader` is embedded in the typed readers, we don't "technically" need to do `cr.columnChunkReader.skipValues(...` and could instead do `cr.skipValues(...`. I only specified the `cr.columnChunkReader` portion to make it explicit. The benefit there if we convert it to being just `cr.skipValues` is that we can then override the `skipValues` function for the `BooleanColumnChunkReader` to allocate the *correct* amount of scratch space, it does result in some duplication of code but I think it's a better solution to avoid the extra allocation where possible. As another, forward looking idea, I'd probably want to have the scratch space use a pool of buffers rather than allocating a new scratch space for every skip but that can be done as a later change. Anyways, did that all make sense as something you could do? 
   
   Let me know if you have any questions. Thanks again for this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org