You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wenchen Fan (Jira)" <ji...@apache.org> on 2022/01/24 03:49:00 UTC

[jira] [Commented] (SPARK-37980) Extend METADATA column to support row indices for file based data sources

    [ https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480811#comment-17480811 ] 

Wenchen Fan commented on SPARK-37980:
-------------------------------------

I think it's going to be a useful feature to support more use cases in the future. I'm not sure how it is related to DS v2 index, but having a unique row identifier can help to build row-level indexes like B-tree.

I think the key here is file-level row index. I don't think we can implement a reliable table-level row index with file source, and the current way of generating row numbers with expressions may return weird results due to filter pushdown (the result is different if you turn on/off filter pushdown).

[~prakharjain09] are we going to implement this feature in the underlying data sources such as parquet and orc?

> Extend METADATA column to support row indices for file based data sources
> -------------------------------------------------------------------------
>
>                 Key: SPARK-37980
>                 URL: https://issues.apache.org/jira/browse/SPARK-37980
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3
>            Reporter: Prakhar Jain
>            Priority: Major
>
> Spark recently added hidden metadata column support for File based datasources as part of  SPARK-37273.
> We should extend it to support ROW_INDEX/ROW_POSITION also.
>  
> Meaning of  ROW_POSITION:
> ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th row in the file will have ROW_INDEX 5.
>  
> Use cases: 
> Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple uniquely identifies row in a table. This information can be used to mark rows e.g. this can be used by indexer etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org