You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cheng Lian (Jira)" <ji...@apache.org> on 2022/02/01 12:24:00 UTC

[jira] [Commented] (SPARK-37980) Extend METADATA column to support row indices for file based data sources

    [ https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485215#comment-17485215 ] 

Cheng Lian commented on SPARK-37980:
------------------------------------

[~prakharjain09], as you've mentioned, it's not super straightforward to customize the Parquet code paths in Spark to achieve the goal. In the meanwhile, this functionality is in general quite useful. I can imagine it enabling other systems in the Parquet ecosystem to build more sophisticated indexing solutions. Instead of doing heavy customizations in Spark, would it be better if we can make the changes happen in upstream {{parquet-mr}} so that other systems can benefit from it more easily?

> Extend METADATA column to support row indices for file based data sources
> -------------------------------------------------------------------------
>
>                 Key: SPARK-37980
>                 URL: https://issues.apache.org/jira/browse/SPARK-37980
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3
>            Reporter: Prakhar Jain
>            Priority: Major
>
> Spark recently added hidden metadata column support for File based datasources as part of  SPARK-37273.
> We should extend it to support ROW_INDEX/ROW_POSITION also.
>  
> Meaning of  ROW_POSITION:
> ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th row in the file will have ROW_INDEX 5.
>  
> Use cases: 
> Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple uniquely identifies row in a table. This information can be used to mark rows e.g. this can be used by indexer etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org