You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Etienne Chauchot <ec...@apache.org> on 2021/03/11 16:17:11 UTC

Re: [Parquet support]

Hi all,

I just submitted another parquet PR that adds ParquetAvroInputFormat 
(I'm using it in a benchmark I'm coding). If anyone is interested in 
reviewing it, be my guest:

https://github.com/apache/flink/pull/15156

I have also an older parquet PR that fixes a format conversion bug that 
is waiting for merge if anyone can review it also (already 1 approval of 
a non-committer, thanks @HuangZhenQiu <https://github.com/HuangZhenQiu>):

https://github.com/apache/flink/pull/14961

If I have time, I'll also tackle the other parquet tickets that I opened 
lately

Best

Etienne

On 25/02/2021 08:34, Jingsong Li wrote:
> Hi Etienne,
>
> ParquetColumnarRowInputFormat is not fully functional yet, it has a good
> performance, but it is hard to support complex types, like array and map...
> So I think a migrated ParquetInputFormat version is required.
>
> Best,
> Jingsong
>
> On Wed, Feb 24, 2021 at 3:43 PM Etienne Chauchot <ec...@apache.org>
> wrote:
>
>> Hi,
>>
>> Thanks guys for the comments !
>>
>> I did not know it was legacy. I will give the new sources a try.
>>
>> Jingsong, when you say "migrate ParquetInputFormat to the new BulkFormat
>> interface", do you mean that the new ParquetColumnarRowInputFormat is
>> not fully functional yet?
>>
>> In the meantime, if you agree, I think I'm still gonna submit a PR for
>> https://issues.apache.org/jira/browse/FLINK-21393 because I need it on
>> an urgent task I'm doing.
>>
>> Best
>>
>> Etienne
>>
>> On 24/02/2021 03:41, Peter Huang wrote:
>>> Hi Jingsong,
>>>
>>> Thanks for pointing this out. Actually, I planned to work on changing
>>> interfaces ParquetTableSource and ParquetInputFormat.
>>> After refactoring the code, I may also help to fix the issue in
>>> https://issues.apache.org/jira/browse/FLINK-21468.
>>>
>>> Best Regards
>>> Peter Huang
>>>
>>> On Tue, Feb 23, 2021 at 6:35 PM Jingsong Li <ji...@gmail.com>
>> wrote:
>>>> Hi Etienne,
>>>>
>>>> Thanks for your reporting.
>>>>
>>>> There are indeed many problems. There is no doubt that we need to
>> improve
>>>> our current format implementation.
>>>>
>>>> But ParquetTableSource and ParquetInputFormat are legacy implementations
>>>> with legacy interfaces. We have introduced new interfaces for execution
>> and
>>>> SQL. You can see:
>>>> - ParquetColumnarRowInputFormat with BulkFormat interface. It is just
>> for
>>>> columnar row reading, not support complex types, we need
>>>> migrate ParquetInputFormat to the new BulkFormat interface.
>>>> - FileSystemTableSource with DynamicTableSource interface, It is a
>> generic
>>>> FileSystem source for all formats, we can just use it for parquet too.
>>>>
>>>> Considering ParquetTableSource and ParquetInputFormat are legacy
>>>> interfaces, I think we can finish migration work first, what do you
>> think?
>>>> Best,
>>>> Jingsong
>>>>
>>>> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <echauchot@apache.org
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I've been playing with Parquet with SQL and Avro lately. I've found
>> some
>>>>> bugs:
>>>>>
>>>>> 1. https://issues.apache.org/jira/browse/FLINK-21388 : I already
>>>>> submitted a PR on this one (https://github.com/apache/flink/pull/14961
>> )
>>>>> 2. https://issues.apache.org/jira/browse/FLINK-21389
>>>>>
>>>>> 3. https://issues.apache.org/jira/browse/FLINK-21468
>>>>>
>>>>> I've already started to work on this ticket:
>>>>> https://issues.apache.org/jira/browse/FLINK-21393
>>>>>
>>>>>
>>>>> I'd be happy to receive your comments on these tickets
>>>>>
>>>>>
>>>>> Best
>>>>>
>>>>> Etienne Chauchot
>>>>>
>>>>>
>>>>>
>>>> --
>>>> Best, Jingsong Lee
>>>>
>

Re: [Parquet support]

Posted by Etienne Chauchot <ec...@apache.org>.
Hi all,

The fix (https://issues.apache.org/jira/browse/FLINK-21388) is now also 
available for flink 1.12 also (thanks Jingsong for merging the 
cherrypick PR)

But before releasing 1.12 branch, I'd like this other PR to be merged: 
https://github.com/apache/flink/pull/15156 that introduces 
ParquetAvroInputFormat.

In this PR I just added deep field copy (in case the source schema is 
multi-level) and fixed serialization issues that I found testing on 
flink 1.11. It should be ready for review.


Once it is merged I'll cherry pick to 1.12 branch.


=> When is the next 1.12 release scheduled ? Do we have enough time to 
include this second parquet feature ?

Best

Etienne Chauchot


On 12/03/2021 15:31, Etienne Chauchot wrote:
> Hi Jingsong,
>
> I just submitted a cherry-pick PR 
> https://github.com/apache/flink/pull/15172 of (1) to release-.1.12 branch
>
> [1] https://github.com/apache/flink/pull/14961
>
> Etienne
>
> On 12/03/2021 14:55, Etienne Chauchot wrote:
>> Hi Jingsong,
>>
>> No problem for the delay. Thanks for merging the first parquet PR.
>>
>> I'll submit the 2 PRs to 1.12 when they're all merged to master. For 
>> that, I just have to submit a PR against this branch: 
>> https://github.com/apache/flink/tree/release-1.12 ?
>>
>> Best,
>>
>> Etienne
>>
>> On 12/03/2021 03:56, Jingsong Li wrote:
>>> Hi Etienne,
>>>
>>> Sorry for the late reply,
>>>
>>> I just merged your bug fixing.
>>> I think you can submit a PR for release-1.12.
>>>
>>> Best,
>>> Jingsong
>>>
>>> On Fri, Mar 12, 2021 at 12:22 AM Etienne Chauchot 
>>> <ec...@apache.org>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I forgot to mention that I submitted the new ParquetAvroInputFormat to
>>>> master (1.13) but it is made to work for 1.12.x (last release) also 
>>>> and
>>>> I'm using it with Flink 1.12.x.
>>>>
>>>> Maybe it could be a good candidate to be included in an upcoming 
>>>> 1.12.3
>>>> release, WDYT ?
>>>>
>>>> Best
>>>>
>>>> Etienne
>>>>
>>>> On 11/03/2021 17:17, Etienne Chauchot wrote:
>>>>> Hi all,
>>>>>
>>>>> I just submitted another parquet PR that adds ParquetAvroInputFormat
>>>>> (I'm using it in a benchmark I'm coding). If anyone is interested in
>>>>> reviewing it, be my guest:
>>>>>
>>>>> https://github.com/apache/flink/pull/15156
>>>>>
>>>>> I have also an older parquet PR that fixes a format conversion bug
>>>>> that is waiting for merge if anyone can review it also (already 1
>>>>> approval of a non-committer, thanks @HuangZhenQiu
>>>>> <https://github.com/HuangZhenQiu>):
>>>>>
>>>>> https://github.com/apache/flink/pull/14961
>>>>>
>>>>> If I have time, I'll also tackle the other parquet tickets that I
>>>>> opened lately
>>>>>
>>>>> Best
>>>>>
>>>>> Etienne
>>>>>
>>>>> On 25/02/2021 08:34, Jingsong Li wrote:
>>>>>> Hi Etienne,
>>>>>>
>>>>>> ParquetColumnarRowInputFormat is not fully functional yet, it has 
>>>>>> a good
>>>>>> performance, but it is hard to support complex types, like array and
>>>> map...
>>>>>> So I think a migrated ParquetInputFormat version is required.
>>>>>>
>>>>>> Best,
>>>>>> Jingsong
>>>>>>
>>>>>> On Wed, Feb 24, 2021 at 3:43 PM Etienne 
>>>>>> Chauchot<ec...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Thanks guys for the comments !
>>>>>>>
>>>>>>> I did not know it was legacy. I will give the new sources a try.
>>>>>>>
>>>>>>> Jingsong, when you say "migrate ParquetInputFormat to the new
>>>> BulkFormat
>>>>>>> interface", do you mean that the new 
>>>>>>> ParquetColumnarRowInputFormat is
>>>>>>> not fully functional yet?
>>>>>>>
>>>>>>> In the meantime, if you agree, I think I'm still gonna submit a 
>>>>>>> PR for
>>>>>>> https://issues.apache.org/jira/browse/FLINK-21393 because I need it
>>>> on
>>>>>>> an urgent task I'm doing.
>>>>>>>
>>>>>>> Best
>>>>>>>
>>>>>>> Etienne
>>>>>>>
>>>>>>> On 24/02/2021 03:41, Peter Huang wrote:
>>>>>>>> Hi Jingsong,
>>>>>>>>
>>>>>>>> Thanks for pointing this out. Actually, I planned to work on 
>>>>>>>> changing
>>>>>>>> interfaces ParquetTableSource and ParquetInputFormat.
>>>>>>>> After refactoring the code, I may also help to fix the issue in
>>>>>>>> https://issues.apache.org/jira/browse/FLINK-21468.
>>>>>>>>
>>>>>>>> Best Regards
>>>>>>>> Peter Huang
>>>>>>>>
>>>>>>>> On Tue, Feb 23, 2021 at 6:35 PM Jingsong 
>>>>>>>> Li<ji...@gmail.com>
>>>>>>> wrote:
>>>>>>>>> Hi Etienne,
>>>>>>>>>
>>>>>>>>> Thanks for your reporting.
>>>>>>>>>
>>>>>>>>> There are indeed many problems. There is no doubt that we need to
>>>>>>> improve
>>>>>>>>> our current format implementation.
>>>>>>>>>
>>>>>>>>> But ParquetTableSource and ParquetInputFormat are legacy
>>>> implementations
>>>>>>>>> with legacy interfaces. We have introduced new interfaces for
>>>> execution
>>>>>>> and
>>>>>>>>> SQL. You can see:
>>>>>>>>> - ParquetColumnarRowInputFormat with BulkFormat interface. It 
>>>>>>>>> is just
>>>>>>> for
>>>>>>>>> columnar row reading, not support complex types, we need
>>>>>>>>> migrate ParquetInputFormat to the new BulkFormat interface.
>>>>>>>>> - FileSystemTableSource with DynamicTableSource interface, It 
>>>>>>>>> is a
>>>>>>> generic
>>>>>>>>> FileSystem source for all formats, we can just use it for parquet
>>>> too.
>>>>>>>>> Considering ParquetTableSource and ParquetInputFormat are legacy
>>>>>>>>> interfaces, I think we can finish migration work first, what 
>>>>>>>>> do you
>>>>>>> think?
>>>>>>>>> Best,
>>>>>>>>> Jingsong
>>>>>>>>>
>>>>>>>>> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <
>>>> echauchot@apache.org
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I've been playing with Parquet with SQL and Avro lately. I've 
>>>>>>>>>> found
>>>>>>> some
>>>>>>>>>> bugs:
>>>>>>>>>>
>>>>>>>>>> 1.https://issues.apache.org/jira/browse/FLINK-21388  : I already
>>>>>>>>>> submitted a PR on this one (
>>>> https://github.com/apache/flink/pull/14961
>>>>>>> )
>>>>>>>>>> 2.https://issues.apache.org/jira/browse/FLINK-21389
>>>>>>>>>>
>>>>>>>>>> 3.https://issues.apache.org/jira/browse/FLINK-21468
>>>>>>>>>>
>>>>>>>>>> I've already started to work on this ticket:
>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-21393
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'd be happy to receive your comments on these tickets
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best
>>>>>>>>>>
>>>>>>>>>> Etienne Chauchot
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> Best, Jingsong Lee
>>>>>>>>>
>>>

Re: [Parquet support]

Posted by Etienne Chauchot <ec...@apache.org>.
Hi Jingsong,

I just submitted a cherry-pick PR 
https://github.com/apache/flink/pull/15172 of (1) to release-.1.12 branch

[1] https://github.com/apache/flink/pull/14961

Etienne

On 12/03/2021 14:55, Etienne Chauchot wrote:
> Hi Jingsong,
>
> No problem for the delay. Thanks for merging the first parquet PR.
>
> I'll submit the 2 PRs to 1.12 when they're all merged to master. For 
> that, I just have to submit a PR against this branch: 
> https://github.com/apache/flink/tree/release-1.12 ?
>
> Best,
>
> Etienne
>
> On 12/03/2021 03:56, Jingsong Li wrote:
>> Hi Etienne,
>>
>> Sorry for the late reply,
>>
>> I just merged your bug fixing.
>> I think you can submit a PR for release-1.12.
>>
>> Best,
>> Jingsong
>>
>> On Fri, Mar 12, 2021 at 12:22 AM Etienne Chauchot <ec...@apache.org>
>> wrote:
>>
>>> Hi,
>>>
>>> I forgot to mention that I submitted the new ParquetAvroInputFormat to
>>> master (1.13) but it is made to work for 1.12.x (last release) also and
>>> I'm using it with Flink 1.12.x.
>>>
>>> Maybe it could be a good candidate to be included in an upcoming 1.12.3
>>> release, WDYT ?
>>>
>>> Best
>>>
>>> Etienne
>>>
>>> On 11/03/2021 17:17, Etienne Chauchot wrote:
>>>> Hi all,
>>>>
>>>> I just submitted another parquet PR that adds ParquetAvroInputFormat
>>>> (I'm using it in a benchmark I'm coding). If anyone is interested in
>>>> reviewing it, be my guest:
>>>>
>>>> https://github.com/apache/flink/pull/15156
>>>>
>>>> I have also an older parquet PR that fixes a format conversion bug
>>>> that is waiting for merge if anyone can review it also (already 1
>>>> approval of a non-committer, thanks @HuangZhenQiu
>>>> <https://github.com/HuangZhenQiu>):
>>>>
>>>> https://github.com/apache/flink/pull/14961
>>>>
>>>> If I have time, I'll also tackle the other parquet tickets that I
>>>> opened lately
>>>>
>>>> Best
>>>>
>>>> Etienne
>>>>
>>>> On 25/02/2021 08:34, Jingsong Li wrote:
>>>>> Hi Etienne,
>>>>>
>>>>> ParquetColumnarRowInputFormat is not fully functional yet, it has 
>>>>> a good
>>>>> performance, but it is hard to support complex types, like array and
>>> map...
>>>>> So I think a migrated ParquetInputFormat version is required.
>>>>>
>>>>> Best,
>>>>> Jingsong
>>>>>
>>>>> On Wed, Feb 24, 2021 at 3:43 PM Etienne 
>>>>> Chauchot<ec...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Thanks guys for the comments !
>>>>>>
>>>>>> I did not know it was legacy. I will give the new sources a try.
>>>>>>
>>>>>> Jingsong, when you say "migrate ParquetInputFormat to the new
>>> BulkFormat
>>>>>> interface", do you mean that the new 
>>>>>> ParquetColumnarRowInputFormat is
>>>>>> not fully functional yet?
>>>>>>
>>>>>> In the meantime, if you agree, I think I'm still gonna submit a 
>>>>>> PR for
>>>>>> https://issues.apache.org/jira/browse/FLINK-21393 because I need it
>>> on
>>>>>> an urgent task I'm doing.
>>>>>>
>>>>>> Best
>>>>>>
>>>>>> Etienne
>>>>>>
>>>>>> On 24/02/2021 03:41, Peter Huang wrote:
>>>>>>> Hi Jingsong,
>>>>>>>
>>>>>>> Thanks for pointing this out. Actually, I planned to work on 
>>>>>>> changing
>>>>>>> interfaces ParquetTableSource and ParquetInputFormat.
>>>>>>> After refactoring the code, I may also help to fix the issue in
>>>>>>> https://issues.apache.org/jira/browse/FLINK-21468.
>>>>>>>
>>>>>>> Best Regards
>>>>>>> Peter Huang
>>>>>>>
>>>>>>> On Tue, Feb 23, 2021 at 6:35 PM Jingsong Li<ji...@gmail.com>
>>>>>> wrote:
>>>>>>>> Hi Etienne,
>>>>>>>>
>>>>>>>> Thanks for your reporting.
>>>>>>>>
>>>>>>>> There are indeed many problems. There is no doubt that we need to
>>>>>> improve
>>>>>>>> our current format implementation.
>>>>>>>>
>>>>>>>> But ParquetTableSource and ParquetInputFormat are legacy
>>> implementations
>>>>>>>> with legacy interfaces. We have introduced new interfaces for
>>> execution
>>>>>> and
>>>>>>>> SQL. You can see:
>>>>>>>> - ParquetColumnarRowInputFormat with BulkFormat interface. It 
>>>>>>>> is just
>>>>>> for
>>>>>>>> columnar row reading, not support complex types, we need
>>>>>>>> migrate ParquetInputFormat to the new BulkFormat interface.
>>>>>>>> - FileSystemTableSource with DynamicTableSource interface, It is a
>>>>>> generic
>>>>>>>> FileSystem source for all formats, we can just use it for parquet
>>> too.
>>>>>>>> Considering ParquetTableSource and ParquetInputFormat are legacy
>>>>>>>> interfaces, I think we can finish migration work first, what do 
>>>>>>>> you
>>>>>> think?
>>>>>>>> Best,
>>>>>>>> Jingsong
>>>>>>>>
>>>>>>>> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <
>>> echauchot@apache.org
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I've been playing with Parquet with SQL and Avro lately. I've 
>>>>>>>>> found
>>>>>> some
>>>>>>>>> bugs:
>>>>>>>>>
>>>>>>>>> 1.https://issues.apache.org/jira/browse/FLINK-21388  : I already
>>>>>>>>> submitted a PR on this one (
>>> https://github.com/apache/flink/pull/14961
>>>>>> )
>>>>>>>>> 2.https://issues.apache.org/jira/browse/FLINK-21389
>>>>>>>>>
>>>>>>>>> 3.https://issues.apache.org/jira/browse/FLINK-21468
>>>>>>>>>
>>>>>>>>> I've already started to work on this ticket:
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-21393
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'd be happy to receive your comments on these tickets
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best
>>>>>>>>>
>>>>>>>>> Etienne Chauchot
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Best, Jingsong Lee
>>>>>>>>
>>

Re: [Parquet support]

Posted by Etienne Chauchot <ec...@apache.org>.
Hi Jingsong,

No problem for the delay. Thanks for merging the first parquet PR.

I'll submit the 2 PRs to 1.12 when they're all merged to master. For 
that, I just have to submit a PR against this branch: 
https://github.com/apache/flink/tree/release-1.12 ?

Best,

Etienne

On 12/03/2021 03:56, Jingsong Li wrote:
> Hi Etienne,
>
> Sorry for the late reply,
>
> I just merged your bug fixing.
> I think you can submit a PR for release-1.12.
>
> Best,
> Jingsong
>
> On Fri, Mar 12, 2021 at 12:22 AM Etienne Chauchot <ec...@apache.org>
> wrote:
>
>> Hi,
>>
>> I forgot to mention that I submitted the new ParquetAvroInputFormat to
>> master (1.13) but it is made to work for 1.12.x (last release) also and
>> I'm using it with Flink 1.12.x.
>>
>> Maybe it could be a good candidate to be included in an upcoming 1.12.3
>> release, WDYT ?
>>
>> Best
>>
>> Etienne
>>
>> On 11/03/2021 17:17, Etienne Chauchot wrote:
>>> Hi all,
>>>
>>> I just submitted another parquet PR that adds ParquetAvroInputFormat
>>> (I'm using it in a benchmark I'm coding). If anyone is interested in
>>> reviewing it, be my guest:
>>>
>>> https://github.com/apache/flink/pull/15156
>>>
>>> I have also an older parquet PR that fixes a format conversion bug
>>> that is waiting for merge if anyone can review it also (already 1
>>> approval of a non-committer, thanks @HuangZhenQiu
>>> <https://github.com/HuangZhenQiu>):
>>>
>>> https://github.com/apache/flink/pull/14961
>>>
>>> If I have time, I'll also tackle the other parquet tickets that I
>>> opened lately
>>>
>>> Best
>>>
>>> Etienne
>>>
>>> On 25/02/2021 08:34, Jingsong Li wrote:
>>>> Hi Etienne,
>>>>
>>>> ParquetColumnarRowInputFormat is not fully functional yet, it has a good
>>>> performance, but it is hard to support complex types, like array and
>> map...
>>>> So I think a migrated ParquetInputFormat version is required.
>>>>
>>>> Best,
>>>> Jingsong
>>>>
>>>> On Wed, Feb 24, 2021 at 3:43 PM Etienne Chauchot<ec...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Thanks guys for the comments !
>>>>>
>>>>> I did not know it was legacy. I will give the new sources a try.
>>>>>
>>>>> Jingsong, when you say "migrate ParquetInputFormat to the new
>> BulkFormat
>>>>> interface", do you mean that the new ParquetColumnarRowInputFormat is
>>>>> not fully functional yet?
>>>>>
>>>>> In the meantime, if you agree, I think I'm still gonna submit a PR for
>>>>> https://issues.apache.org/jira/browse/FLINK-21393  because I need it
>> on
>>>>> an urgent task I'm doing.
>>>>>
>>>>> Best
>>>>>
>>>>> Etienne
>>>>>
>>>>> On 24/02/2021 03:41, Peter Huang wrote:
>>>>>> Hi Jingsong,
>>>>>>
>>>>>> Thanks for pointing this out. Actually, I planned to work on changing
>>>>>> interfaces ParquetTableSource and ParquetInputFormat.
>>>>>> After refactoring the code, I may also help to fix the issue in
>>>>>> https://issues.apache.org/jira/browse/FLINK-21468.
>>>>>>
>>>>>> Best Regards
>>>>>> Peter Huang
>>>>>>
>>>>>> On Tue, Feb 23, 2021 at 6:35 PM Jingsong Li<ji...@gmail.com>
>>>>> wrote:
>>>>>>> Hi Etienne,
>>>>>>>
>>>>>>> Thanks for your reporting.
>>>>>>>
>>>>>>> There are indeed many problems. There is no doubt that we need to
>>>>> improve
>>>>>>> our current format implementation.
>>>>>>>
>>>>>>> But ParquetTableSource and ParquetInputFormat are legacy
>> implementations
>>>>>>> with legacy interfaces. We have introduced new interfaces for
>> execution
>>>>> and
>>>>>>> SQL. You can see:
>>>>>>> - ParquetColumnarRowInputFormat with BulkFormat interface. It is just
>>>>> for
>>>>>>> columnar row reading, not support complex types, we need
>>>>>>> migrate ParquetInputFormat to the new BulkFormat interface.
>>>>>>> - FileSystemTableSource with DynamicTableSource interface, It is a
>>>>> generic
>>>>>>> FileSystem source for all formats, we can just use it for parquet
>> too.
>>>>>>> Considering ParquetTableSource and ParquetInputFormat are legacy
>>>>>>> interfaces, I think we can finish migration work first, what do you
>>>>> think?
>>>>>>> Best,
>>>>>>> Jingsong
>>>>>>>
>>>>>>> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <
>> echauchot@apache.org
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I've been playing with Parquet with SQL and Avro lately. I've found
>>>>> some
>>>>>>>> bugs:
>>>>>>>>
>>>>>>>> 1.https://issues.apache.org/jira/browse/FLINK-21388  : I already
>>>>>>>> submitted a PR on this one (
>> https://github.com/apache/flink/pull/14961
>>>>> )
>>>>>>>> 2.https://issues.apache.org/jira/browse/FLINK-21389
>>>>>>>>
>>>>>>>> 3.https://issues.apache.org/jira/browse/FLINK-21468
>>>>>>>>
>>>>>>>> I've already started to work on this ticket:
>>>>>>>> https://issues.apache.org/jira/browse/FLINK-21393
>>>>>>>>
>>>>>>>>
>>>>>>>> I'd be happy to receive your comments on these tickets
>>>>>>>>
>>>>>>>>
>>>>>>>> Best
>>>>>>>>
>>>>>>>> Etienne Chauchot
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> Best, Jingsong Lee
>>>>>>>
>

Re: [Parquet support]

Posted by Jingsong Li <ji...@gmail.com>.
Hi Etienne,

Sorry for the late reply,

I just merged your bug fixing.
I think you can submit a PR for release-1.12.

Best,
Jingsong

On Fri, Mar 12, 2021 at 12:22 AM Etienne Chauchot <ec...@apache.org>
wrote:

> Hi,
>
> I forgot to mention that I submitted the new ParquetAvroInputFormat to
> master (1.13) but it is made to work for 1.12.x (last release) also and
> I'm using it with Flink 1.12.x.
>
> Maybe it could be a good candidate to be included in an upcoming 1.12.3
> release, WDYT ?
>
> Best
>
> Etienne
>
> On 11/03/2021 17:17, Etienne Chauchot wrote:
> >
> > Hi all,
> >
> > I just submitted another parquet PR that adds ParquetAvroInputFormat
> > (I'm using it in a benchmark I'm coding). If anyone is interested in
> > reviewing it, be my guest:
> >
> > https://github.com/apache/flink/pull/15156
> >
> > I have also an older parquet PR that fixes a format conversion bug
> > that is waiting for merge if anyone can review it also (already 1
> > approval of a non-committer, thanks @HuangZhenQiu
> > <https://github.com/HuangZhenQiu>):
> >
> > https://github.com/apache/flink/pull/14961
> >
> > If I have time, I'll also tackle the other parquet tickets that I
> > opened lately
> >
> > Best
> >
> > Etienne
> >
> > On 25/02/2021 08:34, Jingsong Li wrote:
> >> Hi Etienne,
> >>
> >> ParquetColumnarRowInputFormat is not fully functional yet, it has a good
> >> performance, but it is hard to support complex types, like array and
> map...
> >> So I think a migrated ParquetInputFormat version is required.
> >>
> >> Best,
> >> Jingsong
> >>
> >> On Wed, Feb 24, 2021 at 3:43 PM Etienne Chauchot<ec...@apache.org>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Thanks guys for the comments !
> >>>
> >>> I did not know it was legacy. I will give the new sources a try.
> >>>
> >>> Jingsong, when you say "migrate ParquetInputFormat to the new
> BulkFormat
> >>> interface", do you mean that the new ParquetColumnarRowInputFormat is
> >>> not fully functional yet?
> >>>
> >>> In the meantime, if you agree, I think I'm still gonna submit a PR for
> >>> https://issues.apache.org/jira/browse/FLINK-21393  because I need it
> on
> >>> an urgent task I'm doing.
> >>>
> >>> Best
> >>>
> >>> Etienne
> >>>
> >>> On 24/02/2021 03:41, Peter Huang wrote:
> >>>> Hi Jingsong,
> >>>>
> >>>> Thanks for pointing this out. Actually, I planned to work on changing
> >>>> interfaces ParquetTableSource and ParquetInputFormat.
> >>>> After refactoring the code, I may also help to fix the issue in
> >>>> https://issues.apache.org/jira/browse/FLINK-21468.
> >>>>
> >>>> Best Regards
> >>>> Peter Huang
> >>>>
> >>>> On Tue, Feb 23, 2021 at 6:35 PM Jingsong Li<ji...@gmail.com>
> >>> wrote:
> >>>>> Hi Etienne,
> >>>>>
> >>>>> Thanks for your reporting.
> >>>>>
> >>>>> There are indeed many problems. There is no doubt that we need to
> >>> improve
> >>>>> our current format implementation.
> >>>>>
> >>>>> But ParquetTableSource and ParquetInputFormat are legacy
> implementations
> >>>>> with legacy interfaces. We have introduced new interfaces for
> execution
> >>> and
> >>>>> SQL. You can see:
> >>>>> - ParquetColumnarRowInputFormat with BulkFormat interface. It is just
> >>> for
> >>>>> columnar row reading, not support complex types, we need
> >>>>> migrate ParquetInputFormat to the new BulkFormat interface.
> >>>>> - FileSystemTableSource with DynamicTableSource interface, It is a
> >>> generic
> >>>>> FileSystem source for all formats, we can just use it for parquet
> too.
> >>>>>
> >>>>> Considering ParquetTableSource and ParquetInputFormat are legacy
> >>>>> interfaces, I think we can finish migration work first, what do you
> >>> think?
> >>>>> Best,
> >>>>> Jingsong
> >>>>>
> >>>>> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <
> echauchot@apache.org
> >>>>> wrote:
> >>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I've been playing with Parquet with SQL and Avro lately. I've found
> >>> some
> >>>>>> bugs:
> >>>>>>
> >>>>>> 1.https://issues.apache.org/jira/browse/FLINK-21388  : I already
> >>>>>> submitted a PR on this one (
> https://github.com/apache/flink/pull/14961
> >>> )
> >>>>>> 2.https://issues.apache.org/jira/browse/FLINK-21389
> >>>>>>
> >>>>>> 3.https://issues.apache.org/jira/browse/FLINK-21468
> >>>>>>
> >>>>>> I've already started to work on this ticket:
> >>>>>> https://issues.apache.org/jira/browse/FLINK-21393
> >>>>>>
> >>>>>>
> >>>>>> I'd be happy to receive your comments on these tickets
> >>>>>>
> >>>>>>
> >>>>>> Best
> >>>>>>
> >>>>>> Etienne Chauchot
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>> --
> >>>>> Best, Jingsong Lee
> >>>>>
>


-- 
Best, Jingsong Lee

Re: [Parquet support]

Posted by Etienne Chauchot <ec...@apache.org>.
Hi,

I forgot to mention that I submitted the new ParquetAvroInputFormat to 
master (1.13) but it is made to work for 1.12.x (last release) also and 
I'm using it with Flink 1.12.x.

Maybe it could be a good candidate to be included in an upcoming 1.12.3 
release, WDYT ?

Best

Etienne

On 11/03/2021 17:17, Etienne Chauchot wrote:
>
> Hi all,
>
> I just submitted another parquet PR that adds ParquetAvroInputFormat 
> (I'm using it in a benchmark I'm coding). If anyone is interested in 
> reviewing it, be my guest:
>
> https://github.com/apache/flink/pull/15156
>
> I have also an older parquet PR that fixes a format conversion bug 
> that is waiting for merge if anyone can review it also (already 1 
> approval of a non-committer, thanks @HuangZhenQiu 
> <https://github.com/HuangZhenQiu>):
>
> https://github.com/apache/flink/pull/14961
>
> If I have time, I'll also tackle the other parquet tickets that I 
> opened lately
>
> Best
>
> Etienne
>
> On 25/02/2021 08:34, Jingsong Li wrote:
>> Hi Etienne,
>>
>> ParquetColumnarRowInputFormat is not fully functional yet, it has a good
>> performance, but it is hard to support complex types, like array and map...
>> So I think a migrated ParquetInputFormat version is required.
>>
>> Best,
>> Jingsong
>>
>> On Wed, Feb 24, 2021 at 3:43 PM Etienne Chauchot<ec...@apache.org>
>> wrote:
>>
>>> Hi,
>>>
>>> Thanks guys for the comments !
>>>
>>> I did not know it was legacy. I will give the new sources a try.
>>>
>>> Jingsong, when you say "migrate ParquetInputFormat to the new BulkFormat
>>> interface", do you mean that the new ParquetColumnarRowInputFormat is
>>> not fully functional yet?
>>>
>>> In the meantime, if you agree, I think I'm still gonna submit a PR for
>>> https://issues.apache.org/jira/browse/FLINK-21393  because I need it on
>>> an urgent task I'm doing.
>>>
>>> Best
>>>
>>> Etienne
>>>
>>> On 24/02/2021 03:41, Peter Huang wrote:
>>>> Hi Jingsong,
>>>>
>>>> Thanks for pointing this out. Actually, I planned to work on changing
>>>> interfaces ParquetTableSource and ParquetInputFormat.
>>>> After refactoring the code, I may also help to fix the issue in
>>>> https://issues.apache.org/jira/browse/FLINK-21468.
>>>>
>>>> Best Regards
>>>> Peter Huang
>>>>
>>>> On Tue, Feb 23, 2021 at 6:35 PM Jingsong Li<ji...@gmail.com>
>>> wrote:
>>>>> Hi Etienne,
>>>>>
>>>>> Thanks for your reporting.
>>>>>
>>>>> There are indeed many problems. There is no doubt that we need to
>>> improve
>>>>> our current format implementation.
>>>>>
>>>>> But ParquetTableSource and ParquetInputFormat are legacy implementations
>>>>> with legacy interfaces. We have introduced new interfaces for execution
>>> and
>>>>> SQL. You can see:
>>>>> - ParquetColumnarRowInputFormat with BulkFormat interface. It is just
>>> for
>>>>> columnar row reading, not support complex types, we need
>>>>> migrate ParquetInputFormat to the new BulkFormat interface.
>>>>> - FileSystemTableSource with DynamicTableSource interface, It is a
>>> generic
>>>>> FileSystem source for all formats, we can just use it for parquet too.
>>>>>
>>>>> Considering ParquetTableSource and ParquetInputFormat are legacy
>>>>> interfaces, I think we can finish migration work first, what do you
>>> think?
>>>>> Best,
>>>>> Jingsong
>>>>>
>>>>> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <echauchot@apache.org
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I've been playing with Parquet with SQL and Avro lately. I've found
>>> some
>>>>>> bugs:
>>>>>>
>>>>>> 1.https://issues.apache.org/jira/browse/FLINK-21388  : I already
>>>>>> submitted a PR on this one (https://github.com/apache/flink/pull/14961
>>> )
>>>>>> 2.https://issues.apache.org/jira/browse/FLINK-21389
>>>>>>
>>>>>> 3.https://issues.apache.org/jira/browse/FLINK-21468
>>>>>>
>>>>>> I've already started to work on this ticket:
>>>>>> https://issues.apache.org/jira/browse/FLINK-21393
>>>>>>
>>>>>>
>>>>>> I'd be happy to receive your comments on these tickets
>>>>>>
>>>>>>
>>>>>> Best
>>>>>>
>>>>>> Etienne Chauchot
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> Best, Jingsong Lee
>>>>>