You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Xinli Shang (Jira)" <ji...@apache.org> on 2022/02/02 16:40:00 UTC
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485949#comment-17485949 ]
Xinli Shang commented on PARQUET-2117:
--------------------------------------
Thanks for opening this Jira! Look forward to the PR.
> Add rowPosition API in parquet record readers
> ---------------------------------------------
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-mr
> Reporter: Prakhar Jain
> Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This can be useful to create an index (e.g. B+ tree) over a parquet file/parquet table (e.g. Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from such a functionality:
> # Apache Iceberg needs this functionality. It has this implementation already as it relies on low level parquet APIs - [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
> # Apache Spark can use this functionality - SPARK-37980
--
This message was sent by Atlassian Jira
(v8.20.1#820001)