You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Xinli Shang (Jira)" <ji...@apache.org> on 2022/02/02 16:40:00 UTC

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

    [ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485949#comment-17485949 ] 

Xinli Shang commented on PARQUET-2117:
--------------------------------------

Thanks for opening this Jira! Look forward to the PR.

> Add rowPosition API in parquet record readers
> ---------------------------------------------
>
>                 Key: PARQUET-2117
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2117
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: Prakhar Jain
>            Priority: Major
>             Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This can be useful to create an index (e.g. B+ tree) over a parquet file/parquet table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation already as it relies on low level parquet APIs -  [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)