You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/16 11:41:23 UTC

[GitHub] [arrow-rs] alamb opened a new issue, #1705: Parquet: Create PageIndex on write

alamb opened a new issue, #1705:
URL: https://github.com/apache/arrow-rs/issues/1705

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   Suggested by @ParadoxShmaradox ON #1191
   
   In analytics systems, parquet files are passed around between different systems, so the more metadata is written by the rust implementation of parquet, the better other systems may be able to handle it. 
   
   One such type of metadata is `PageIndex` https://github.com/apache/parquet-format/blob/master/PageIndex.md
   
   Specifically, I believe this is called a `ColumnIndex` and `OffsetIndex` in parquet-format: 
   
   https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L938-L971
   
   https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L926-L932
   
   Which is used to speed up queries in certain systems -- for example https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
   
   
   **Describe the solution you'd like**
   1. Write `PageIndex` structures into files created by parquet-rs (specif
   2. Provide access to reading `PageIndex` from the parquet metadata structures
   
   
   Note that if creating the `PageIndex`  structures is too expensive, their creation could be optional and controlled via options. 
   
   
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features you've considered.
   
   **Additional context**
   Original request here https://github.com/apache/arrow-rs/issues/1191#issuecomment-1126814290
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1705: Parquet: Read/write support for PageIndex (ColumnIndex / OffsetIndex)

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1705:
URL: https://github.com/apache/arrow-rs/issues/1705#issuecomment-1220752285

   I think this is now complete


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on issue #1705: Parquet: Read/write support for PageIndex (ColumnIndex / OffsetIndex)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on issue #1705:
URL: https://github.com/apache/arrow-rs/issues/1705#issuecomment-1135415886

   @alamb @tustvold 
   Our team is building a query engine which is sensitive to time-to-first-result latency(bottleneck is scan and agg). So we want to support PageIndex in arrow-rs avoid useless IO costs. I will follow up before discussions, Try to give a design of this (according to parquet-mr). If you can give any advice,  would be grateful ❤️.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1705: Parquet: Read/write support for PageIndex (ColumnIndex / OffsetIndex)

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1705:
URL: https://github.com/apache/arrow-rs/issues/1705#issuecomment-1135450037

   Hi @Ted-Jiang, I think the first step will be to support reading and writing the page index. Once that is done, we can look to include support in #1605. I'm currently actively working on object store IO, so if you wanted to handle the index encode/decode we can probably parallelize this effort 😃


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #1705: Parquet: Read/write support for PageIndex (ColumnIndex / OffsetIndex)

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1705:
URL: https://github.com/apache/arrow-rs/issues/1705#issuecomment-1135945770

   I think the idea of parallelizing the implementation (as @tustvold  suggests to implement the encode/decode of the page index) would be great
   
   If creating the page index on write consumes too much time, perhaps we can make writing optional (aka in https://docs.rs/parquet/14.0.0/parquet/file/properties/struct.WriterProperties.html) -- same could be done for reading the page index if it is too expensive


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on issue #1705: Parquet: Read/write support for PageIndex (ColumnIndex / OffsetIndex)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on issue #1705:
URL: https://github.com/apache/arrow-rs/issues/1705#issuecomment-1135454503

   > Hi @Ted-Jiang, I think the first step will be to support reading and writing the page index. Once that is done, we can look to include support in #1605. I'm currently actively working on object store IO, so if you wanted to handle the index encode/decode we can probably parallelize this effort 😃
   
   Sure!💪 I prefer to support reading(from java version) first, check the effect on performance. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #1705: Parquet: Read/write support for PageIndex (ColumnIndex / OffsetIndex)

Posted by GitBox <gi...@apache.org>.
tustvold closed issue #1705: Parquet: Read/write support for  PageIndex (ColumnIndex / OffsetIndex)
URL: https://github.com/apache/arrow-rs/issues/1705


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org