You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/04 13:28:52 UTC

[GitHub] [arrow-rs] zeevm opened a new issue #1270: Expose Dictionary to reader

zeevm opened a new issue #1270:
URL: https://github.com/apache/arrow-rs/issues/1270


   Many Parquet query engines have optimizations that rely on Dictionary encoded columns, e.g. for selections with filter.
   
   The Rust implementation of the Parquet reader makes it difficult for a reader to read dictionary encoded values because it doesn't expose the RLE decoder to the reader code, so a reader that wishes to work with dictionary values has to re-implement an RLE decoder to read values from dictionary encoded data pages.
   
   This can be easily addressed by making the RLE code public outside the crate.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold edited a comment on issue #1270: Expose Dictionary to reader

Posted by GitBox <gi...@apache.org>.

tustvold edited a comment on issue #1270:
URL: https://github.com/apache/arrow-rs/issues/1270#issuecomment-1030579638


   Thank you for taking the time to respond, I figured that might be the case, but thought it couldn't hurt to check.
   
   I'm sure you're aware, but just as a heads up if you're reading the data directly, the RLE encoding is not length preserving (https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003119362), and a column chunk may not be consistently dictionary encoded (e.g. if the dictionary gets too large).
   
   FWIW there were some generics added in #1041 and evolved since to aid decoding columns to custom in-memory representations. They're currently crate-local, but that could be changed if you wished to use them. Just let me know 😀


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold edited a comment on issue #1270: Expose Dictionary to reader

Posted by GitBox <gi...@apache.org>.

tustvold edited a comment on issue #1270:
URL: https://github.com/apache/arrow-rs/issues/1270#issuecomment-1030323613


   I'd be very interested in any details you can share about your particular use-case, in particular if there is anyway we might be able to combine efforts in this space. The proposal in #1191 is just that, and any input you'd be willing to provide would be most appreciated :+1:
   
   If you're using arrow, I'd also potentially draw your attention to #1180 which will preserve the dictionary encoding present in the parquet file for dictionary arrays, and is slated for inclusion as the default behaviour in arrow 9.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #1270: Expose Dictionary to reader

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #1270:
URL: https://github.com/apache/arrow-rs/issues/1270#issuecomment-1030579638


   Thank you for taking the time to respond, I figured that might be the case, but thought it couldn't hurt to check.
   
   As a heads up if you're reading the data directly, the RLE encoding is not length preserving (#1111), and there are some other shenanigans concerning dictionary spilling.
   
   FWIW there were some generics added in #1041 and evolved since to aid decoding columns to custom in-memory representations. They're currently crate-local, but that could be changed if you wished to use them. Just let me know 😀


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold edited a comment on issue #1270: Expose Dictionary to reader

Posted by GitBox <gi...@apache.org>.

tustvold edited a comment on issue #1270:
URL: https://github.com/apache/arrow-rs/issues/1270#issuecomment-1030323613


   I'd be very interested in any details you can share about your particular use-case, in particular if there is anyway we might be able to combine efforts in this space. The proposal in #1191 is just that, and any input you'd be willing to provide would be most appreciated :+1:
   
   I'd also potentially draw your attention to #1180 which if you're using arrow will preserve the dictionary encoding present in the parquet file, and is slated for inclusion as the default behaviour in arrow 9.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #1270: Expose Dictionary to reader

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #1270:
URL: https://github.com/apache/arrow-rs/issues/1270#issuecomment-1030323613


   I'd be very interested in any details you can share about your particular use-case, in particular if there is anyway we might be able to combine efforts in this space. The proposal in #1191 is just that, and any input you'd be willing to provide would be most appreciated :+1:
   
   I'd also potentially draw your attention to #1180 which if you're using arrow provides a mechanism to extract data preserving the dictionary encoding present in the parquet file, and is slated for inclusion in arrow 9.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold edited a comment on issue #1270: Expose Dictionary to reader

Posted by GitBox <gi...@apache.org>.

tustvold edited a comment on issue #1270:
URL: https://github.com/apache/arrow-rs/issues/1270#issuecomment-1030579638


   Thank you for taking the time to respond, I figured that might be the case, but thought it couldn't hurt to check.
   
   I'm sure you're aware, but just as a heads up if you're reading the data directly, the RLE encoding is not length preserving (#1111), and a column chunk may not be consistently dictionary encoded (e.g. if the dictionary gets too large).
   
   FWIW there were some generics added in #1041 and evolved since to aid decoding columns to custom in-memory representations. They're currently crate-local, but that could be changed if you wished to use them. Just let me know 😀


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] zeevm commented on issue #1270: Expose Dictionary to reader

Posted by GitBox <gi...@apache.org>.

zeevm commented on issue #1270:
URL: https://github.com/apache/arrow-rs/issues/1270#issuecomment-1030600067


   @tustvold My engine assumes a column is either fully dictionary encoded or not, so for my use case I first have to scan the headers of all pages in the a column chunk to assert they're all dictionary encoded, if any of them are not (other than the dictionary page itself of course), I treat the column as not-dictionary encoded, meaning I'll read with a ColumnReader instead of a PageReader and let the library handle the variously encoded pages.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #1270: Expose Dictionary to reader

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1270:
URL: https://github.com/apache/arrow-rs/issues/1270#issuecomment-1030298354


   FYI I think @tustvold  has some plans to contribute functionality that may be similar to the parqet crate directly in https://github.com/apache/arrow-rs/issues/1191


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] sunchao closed issue #1270: Expose Dictionary to reader

Posted by GitBox <gi...@apache.org>.

sunchao closed issue #1270:
URL: https://github.com/apache/arrow-rs/issues/1270


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold edited a comment on issue #1270: Expose Dictionary to reader

Posted by GitBox <gi...@apache.org>.

tustvold edited a comment on issue #1270:
URL: https://github.com/apache/arrow-rs/issues/1270#issuecomment-1030579638


   Thank you for taking the time to respond, I figured that might be the case, but thought it couldn't hurt to check.
   
   I'm sure you're aware, but just as a heads up if you're reading the data directly, the RLE encoding is not length preserving https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003119362, and a column chunk may not be consistently dictionary encoded (e.g. if the dictionary gets too large).
   
   FWIW there were some generics added in #1041 and evolved since to aid decoding columns to custom in-memory representations. They're currently crate-local, but that could be changed if you wished to use them. Just let me know 😀


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] zeevm commented on issue #1270: Expose Dictionary to reader

Posted by GitBox <gi...@apache.org>.

zeevm commented on issue #1270:
URL: https://github.com/apache/arrow-rs/issues/1270#issuecomment-1030575351


   @alamb @tustvold My use case is a proprietary analytical DB engine, it has its' own proprietary storage format but also allows running queries against external formats like Parquet.
   
   As it already has a highly optimized scan capability of dictionary encoded data, all I want is for it to have access to the raw Parquet dictionary.
   
   I don't want to take a dependency on Arrow array for that as I'm not using Arrow at all, I don't deserialize Parquet into Arrow since the engine I'm working on has its' own in-memory representation (I don't even build Arrow with Parquet)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org