You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "etseidl (via GitHub)" <gi...@apache.org> on 2023/02/10 20:13:18 UTC

[GitHub] [arrow] etseidl commented on pull request #34054: GH-34053: [C++][Parquet] Write parquet page index

etseidl commented on PR #34054:
URL: https://github.com/apache/arrow/pull/34054#issuecomment-1426288326

   @wgtmac another issue to consider as you implement the page index is rows that span multiple pages. With nested columns, it is possible to have single rows that are so large that they exceed the requested page size.  arrow-cpp currently will honor the page size by splitting these rows across multiple pages.  The current parquet spec, however, seems to require that pages begin at row boundaries (i.e. the repetition level R is 0 for the first value in each page, see [here](https://github.com/apache/parquet-format/blob/5205dc7b7c0b910ea6af33cadbd2963c0c47c726/src/main/thrift/parquet.thrift#L564) and [here](https://github.com/apache/parquet-format/blob/5205dc7b7c0b910ea6af33cadbd2963c0c47c726/src/main/thrift/parquet.thrift#L918)).  Do you concur and think this should be another blocking issue or part of this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org