You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by sunshun18 <su...@126.com> on 2022/12/05 03:54:38 UTC

Patch to support Parquet schema evolution

Hi there,


I find an null-value issue when using Flink to read parquet files with multi versions of schema (V1->V2->V3->..->Vn).
Assuming there are two fileds in given parquet schema as below, and filed F2 only exist in version 2.


Version1: F1
Version2: F1, F2


Currently the value of filed F2 will be empty when reading data from parquet file using schema version2.
I explore the implementation, and find Flink use a collection named `unknownFieldsIndices` to track the nonexistent fields, applied to all parquet files under given path.


I draft a patch to fix this issue with unit test.


https://issues.apache.org/jira/browse/FLINK-29527
https://github.com/apache/flink/pull/21149


As these PR is pending for a long time, I hope any commitor can help review it and provide any feedback if possible.


Thanks!
Shun

Re: Patch to support Parquet schema evolution

Posted by yuxia <lu...@alumni.sjtu.edu.cn>.
Hi, Shun. 
Thanks for the contribution.  I'll have a look first and then find some committers help review & merge.

Best regards,
Yuxia

----- 原始邮件 -----
发件人: "sunshun18" <su...@126.com>
收件人: "dev" <de...@flink.apache.org>
发送时间: 星期一, 2022年 12 月 05日 上午 11:54:38
主题: Patch to support Parquet schema evolution

Hi there,


I find an null-value issue when using Flink to read parquet files with multi versions of schema (V1->V2->V3->..->Vn).
Assuming there are two fileds in given parquet schema as below, and filed F2 only exist in version 2.


Version1: F1
Version2: F1, F2


Currently the value of filed F2 will be empty when reading data from parquet file using schema version2.
I explore the implementation, and find Flink use a collection named `unknownFieldsIndices` to track the nonexistent fields, applied to all parquet files under given path.


I draft a patch to fix this issue with unit test.


https://issues.apache.org/jira/browse/FLINK-29527
https://github.com/apache/flink/pull/21149


As these PR is pending for a long time, I hope any commitor can help review it and provide any feedback if possible.


Thanks!
Shun