You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Jackie Zhang (Jira)" <ji...@apache.org> on 2022/02/03 00:40:00 UTC

[jira] [Created] (SPARK-38094) Parquet: enable matching schema columns by field id

Jackie Zhang created SPARK-38094:
------------------------------------

             Summary: Parquet: enable matching schema columns by field id
                 Key: SPARK-38094
                 URL: https://issues.apache.org/jira/browse/SPARK-38094
             Project: Spark
          Issue Type: New Feature
          Components: Spark Core
    Affects Versions: 3.3
            Reporter: Jackie Zhang


Field Id is a native field in the Parquet schema ([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398])

After this PR, when the requested schema has field IDs, Parquet readers will first use the field ID to determine which Parquet columns to read, before falling back to using column names as before. It enables matching columns by field id for supported DWs like iceberg and Delta.

This PR supports:
 * OSS vectorized reader

does not support:
 * Parquet-mr reader due to lack of field id support (needs a follow up ticket)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org