You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Etienne Chauchot <ec...@apache.org> on 2021/02/23 16:45:45 UTC

[Parquet support]

Hi all,

I've been playing with Parquet with SQL and Avro lately. I've found some 
bugs:

1. https://issues.apache.org/jira/browse/FLINK-21388 : I already 
submitted a PR on this one (https://github.com/apache/flink/pull/14961)

2. https://issues.apache.org/jira/browse/FLINK-21389

3. https://issues.apache.org/jira/browse/FLINK-21468

I've already started to work on this ticket: 
https://issues.apache.org/jira/browse/FLINK-21393


I'd be happy to receive your comments on these tickets


Best

Etienne Chauchot



Re: [Parquet support]

Posted by Etienne Chauchot <ec...@apache.org>.
Hi all,

The fix (https://issues.apache.org/jira/browse/FLINK-21388) is now also 
available for flink 1.12 also (thanks Jingsong for merging the 
cherrypick PR)

But before releasing 1.12 branch, I'd like this other PR to be merged: 
https://github.com/apache/flink/pull/15156 that introduces 
ParquetAvroInputFormat.

In this PR I just added deep field copy (in case the source schema is 
multi-level) and fixed serialization issues that I found testing on 
flink 1.11. It should be ready for review.


Once it is merged I'll cherry pick to 1.12 branch.


=> When is the next 1.12 release scheduled ? Do we have enough time to 
include this second parquet feature ?

Best

Etienne Chauchot


On 12/03/2021 15:31, Etienne Chauchot wrote:
> Hi Jingsong,
>
> I just submitted a cherry-pick PR 
> https://github.com/apache/flink/pull/15172 of (1) to release-.1.12 branch
>
> [1] https://github.com/apache/flink/pull/14961
>
> Etienne
>
> On 12/03/2021 14:55, Etienne Chauchot wrote:
>> Hi Jingsong,
>>
>> No problem for the delay. Thanks for merging the first parquet PR.
>>
>> I'll submit the 2 PRs to 1.12 when they're all merged to master. For 
>> that, I just have to submit a PR against this branch: 
>> https://github.com/apache/flink/tree/release-1.12 ?
>>
>> Best,
>>
>> Etienne
>>
>> On 12/03/2021 03:56, Jingsong Li wrote:
>>> Hi Etienne,
>>>
>>> Sorry for the late reply,
>>>
>>> I just merged your bug fixing.
>>> I think you can submit a PR for release-1.12.
>>>
>>> Best,
>>> Jingsong
>>>
>>> On Fri, Mar 12, 2021 at 12:22 AM Etienne Chauchot 
>>> <ec...@apache.org>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I forgot to mention that I submitted the new ParquetAvroInputFormat to
>>>> master (1.13) but it is made to work for 1.12.x (last release) also 
>>>> and
>>>> I'm using it with Flink 1.12.x.
>>>>
>>>> Maybe it could be a good candidate to be included in an upcoming 
>>>> 1.12.3
>>>> release, WDYT ?
>>>>
>>>> Best
>>>>
>>>> Etienne
>>>>
>>>> On 11/03/2021 17:17, Etienne Chauchot wrote:
>>>>> Hi all,
>>>>>
>>>>> I just submitted another parquet PR that adds ParquetAvroInputFormat
>>>>> (I'm using it in a benchmark I'm coding). If anyone is interested in
>>>>> reviewing it, be my guest:
>>>>>
>>>>> https://github.com/apache/flink/pull/15156
>>>>>
>>>>> I have also an older parquet PR that fixes a format conversion bug
>>>>> that is waiting for merge if anyone can review it also (already 1
>>>>> approval of a non-committer, thanks @HuangZhenQiu
>>>>> <https://github.com/HuangZhenQiu>):
>>>>>
>>>>> https://github.com/apache/flink/pull/14961
>>>>>
>>>>> If I have time, I'll also tackle the other parquet tickets that I
>>>>> opened lately
>>>>>
>>>>> Best
>>>>>
>>>>> Etienne
>>>>>
>>>>> On 25/02/2021 08:34, Jingsong Li wrote:
>>>>>> Hi Etienne,
>>>>>>
>>>>>> ParquetColumnarRowInputFormat is not fully functional yet, it has 
>>>>>> a good
>>>>>> performance, but it is hard to support complex types, like array and
>>>> map...
>>>>>> So I think a migrated ParquetInputFormat version is required.
>>>>>>
>>>>>> Best,
>>>>>> Jingsong
>>>>>>
>>>>>> On Wed, Feb 24, 2021 at 3:43 PM Etienne 
>>>>>> Chauchot<ec...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Thanks guys for the comments !
>>>>>>>
>>>>>>> I did not know it was legacy. I will give the new sources a try.
>>>>>>>
>>>>>>> Jingsong, when you say "migrate ParquetInputFormat to the new
>>>> BulkFormat
>>>>>>> interface", do you mean that the new 
>>>>>>> ParquetColumnarRowInputFormat is
>>>>>>> not fully functional yet?
>>>>>>>
>>>>>>> In the meantime, if you agree, I think I'm still gonna submit a 
>>>>>>> PR for
>>>>>>> https://issues.apache.org/jira/browse/FLINK-21393 because I need it
>>>> on
>>>>>>> an urgent task I'm doing.
>>>>>>>
>>>>>>> Best
>>>>>>>
>>>>>>> Etienne
>>>>>>>
>>>>>>> On 24/02/2021 03:41, Peter Huang wrote:
>>>>>>>> Hi Jingsong,
>>>>>>>>
>>>>>>>> Thanks for pointing this out. Actually, I planned to work on 
>>>>>>>> changing
>>>>>>>> interfaces ParquetTableSource and ParquetInputFormat.
>>>>>>>> After refactoring the code, I may also help to fix the issue in
>>>>>>>> https://issues.apache.org/jira/browse/FLINK-21468.
>>>>>>>>
>>>>>>>> Best Regards
>>>>>>>> Peter Huang
>>>>>>>>
>>>>>>>> On Tue, Feb 23, 2021 at 6:35 PM Jingsong 
>>>>>>>> Li<ji...@gmail.com>
>>>>>>> wrote:
>>>>>>>>> Hi Etienne,
>>>>>>>>>
>>>>>>>>> Thanks for your reporting.
>>>>>>>>>
>>>>>>>>> There are indeed many problems. There is no doubt that we need to
>>>>>>> improve
>>>>>>>>> our current format implementation.
>>>>>>>>>
>>>>>>>>> But ParquetTableSource and ParquetInputFormat are legacy
>>>> implementations
>>>>>>>>> with legacy interfaces. We have introduced new interfaces for
>>>> execution
>>>>>>> and
>>>>>>>>> SQL. You can see:
>>>>>>>>> - ParquetColumnarRowInputFormat with BulkFormat interface. It 
>>>>>>>>> is just
>>>>>>> for
>>>>>>>>> columnar row reading, not support complex types, we need
>>>>>>>>> migrate ParquetInputFormat to the new BulkFormat interface.
>>>>>>>>> - FileSystemTableSource with DynamicTableSource interface, It 
>>>>>>>>> is a
>>>>>>> generic
>>>>>>>>> FileSystem source for all formats, we can just use it for parquet
>>>> too.
>>>>>>>>> Considering ParquetTableSource and ParquetInputFormat are legacy
>>>>>>>>> interfaces, I think we can finish migration work first, what 
>>>>>>>>> do you
>>>>>>> think?
>>>>>>>>> Best,
>>>>>>>>> Jingsong
>>>>>>>>>
>>>>>>>>> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <
>>>> echauchot@apache.org
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I've been playing with Parquet with SQL and Avro lately. I've 
>>>>>>>>>> found
>>>>>>> some
>>>>>>>>>> bugs:
>>>>>>>>>>
>>>>>>>>>> 1.https://issues.apache.org/jira/browse/FLINK-21388  : I already
>>>>>>>>>> submitted a PR on this one (
>>>> https://github.com/apache/flink/pull/14961
>>>>>>> )
>>>>>>>>>> 2.https://issues.apache.org/jira/browse/FLINK-21389
>>>>>>>>>>
>>>>>>>>>> 3.https://issues.apache.org/jira/browse/FLINK-21468
>>>>>>>>>>
>>>>>>>>>> I've already started to work on this ticket:
>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-21393
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'd be happy to receive your comments on these tickets
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best
>>>>>>>>>>
>>>>>>>>>> Etienne Chauchot
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> Best, Jingsong Lee
>>>>>>>>>
>>>

Re: [Parquet support]

Posted by Etienne Chauchot <ec...@apache.org>.
Hi Jingsong,

I just submitted a cherry-pick PR 
https://github.com/apache/flink/pull/15172 of (1) to release-.1.12 branch

[1] https://github.com/apache/flink/pull/14961

Etienne

On 12/03/2021 14:55, Etienne Chauchot wrote:
> Hi Jingsong,
>
> No problem for the delay. Thanks for merging the first parquet PR.
>
> I'll submit the 2 PRs to 1.12 when they're all merged to master. For 
> that, I just have to submit a PR against this branch: 
> https://github.com/apache/flink/tree/release-1.12 ?
>
> Best,
>
> Etienne
>
> On 12/03/2021 03:56, Jingsong Li wrote:
>> Hi Etienne,
>>
>> Sorry for the late reply,
>>
>> I just merged your bug fixing.
>> I think you can submit a PR for release-1.12.
>>
>> Best,
>> Jingsong
>>
>> On Fri, Mar 12, 2021 at 12:22 AM Etienne Chauchot <ec...@apache.org>
>> wrote:
>>
>>> Hi,
>>>
>>> I forgot to mention that I submitted the new ParquetAvroInputFormat to
>>> master (1.13) but it is made to work for 1.12.x (last release) also and
>>> I'm using it with Flink 1.12.x.
>>>
>>> Maybe it could be a good candidate to be included in an upcoming 1.12.3
>>> release, WDYT ?
>>>
>>> Best
>>>
>>> Etienne
>>>
>>> On 11/03/2021 17:17, Etienne Chauchot wrote:
>>>> Hi all,
>>>>
>>>> I just submitted another parquet PR that adds ParquetAvroInputFormat
>>>> (I'm using it in a benchmark I'm coding). If anyone is interested in
>>>> reviewing it, be my guest:
>>>>
>>>> https://github.com/apache/flink/pull/15156
>>>>
>>>> I have also an older parquet PR that fixes a format conversion bug
>>>> that is waiting for merge if anyone can review it also (already 1
>>>> approval of a non-committer, thanks @HuangZhenQiu
>>>> <https://github.com/HuangZhenQiu>):
>>>>
>>>> https://github.com/apache/flink/pull/14961
>>>>
>>>> If I have time, I'll also tackle the other parquet tickets that I
>>>> opened lately
>>>>
>>>> Best
>>>>
>>>> Etienne
>>>>
>>>> On 25/02/2021 08:34, Jingsong Li wrote:
>>>>> Hi Etienne,
>>>>>
>>>>> ParquetColumnarRowInputFormat is not fully functional yet, it has 
>>>>> a good
>>>>> performance, but it is hard to support complex types, like array and
>>> map...
>>>>> So I think a migrated ParquetInputFormat version is required.
>>>>>
>>>>> Best,
>>>>> Jingsong
>>>>>
>>>>> On Wed, Feb 24, 2021 at 3:43 PM Etienne 
>>>>> Chauchot<ec...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Thanks guys for the comments !
>>>>>>
>>>>>> I did not know it was legacy. I will give the new sources a try.
>>>>>>
>>>>>> Jingsong, when you say "migrate ParquetInputFormat to the new
>>> BulkFormat
>>>>>> interface", do you mean that the new 
>>>>>> ParquetColumnarRowInputFormat is
>>>>>> not fully functional yet?
>>>>>>
>>>>>> In the meantime, if you agree, I think I'm still gonna submit a 
>>>>>> PR for
>>>>>> https://issues.apache.org/jira/browse/FLINK-21393 because I need it
>>> on
>>>>>> an urgent task I'm doing.
>>>>>>
>>>>>> Best
>>>>>>
>>>>>> Etienne
>>>>>>
>>>>>> On 24/02/2021 03:41, Peter Huang wrote:
>>>>>>> Hi Jingsong,
>>>>>>>
>>>>>>> Thanks for pointing this out. Actually, I planned to work on 
>>>>>>> changing
>>>>>>> interfaces ParquetTableSource and ParquetInputFormat.
>>>>>>> After refactoring the code, I may also help to fix the issue in
>>>>>>> https://issues.apache.org/jira/browse/FLINK-21468.
>>>>>>>
>>>>>>> Best Regards
>>>>>>> Peter Huang
>>>>>>>
>>>>>>> On Tue, Feb 23, 2021 at 6:35 PM Jingsong Li<ji...@gmail.com>
>>>>>> wrote:
>>>>>>>> Hi Etienne,
>>>>>>>>
>>>>>>>> Thanks for your reporting.
>>>>>>>>
>>>>>>>> There are indeed many problems. There is no doubt that we need to
>>>>>> improve
>>>>>>>> our current format implementation.
>>>>>>>>
>>>>>>>> But ParquetTableSource and ParquetInputFormat are legacy
>>> implementations
>>>>>>>> with legacy interfaces. We have introduced new interfaces for
>>> execution
>>>>>> and
>>>>>>>> SQL. You can see:
>>>>>>>> - ParquetColumnarRowInputFormat with BulkFormat interface. It 
>>>>>>>> is just
>>>>>> for
>>>>>>>> columnar row reading, not support complex types, we need
>>>>>>>> migrate ParquetInputFormat to the new BulkFormat interface.
>>>>>>>> - FileSystemTableSource with DynamicTableSource interface, It is a
>>>>>> generic
>>>>>>>> FileSystem source for all formats, we can just use it for parquet
>>> too.
>>>>>>>> Considering ParquetTableSource and ParquetInputFormat are legacy
>>>>>>>> interfaces, I think we can finish migration work first, what do 
>>>>>>>> you
>>>>>> think?
>>>>>>>> Best,
>>>>>>>> Jingsong
>>>>>>>>
>>>>>>>> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <
>>> echauchot@apache.org
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I've been playing with Parquet with SQL and Avro lately. I've 
>>>>>>>>> found
>>>>>> some
>>>>>>>>> bugs:
>>>>>>>>>
>>>>>>>>> 1.https://issues.apache.org/jira/browse/FLINK-21388  : I already
>>>>>>>>> submitted a PR on this one (
>>> https://github.com/apache/flink/pull/14961
>>>>>> )
>>>>>>>>> 2.https://issues.apache.org/jira/browse/FLINK-21389
>>>>>>>>>
>>>>>>>>> 3.https://issues.apache.org/jira/browse/FLINK-21468
>>>>>>>>>
>>>>>>>>> I've already started to work on this ticket:
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-21393
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'd be happy to receive your comments on these tickets
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best
>>>>>>>>>
>>>>>>>>> Etienne Chauchot
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Best, Jingsong Lee
>>>>>>>>
>>

Re: [Parquet support]

Posted by Etienne Chauchot <ec...@apache.org>.
Hi Jingsong,

No problem for the delay. Thanks for merging the first parquet PR.

I'll submit the 2 PRs to 1.12 when they're all merged to master. For 
that, I just have to submit a PR against this branch: 
https://github.com/apache/flink/tree/release-1.12 ?

Best,

Etienne

On 12/03/2021 03:56, Jingsong Li wrote:
> Hi Etienne,
>
> Sorry for the late reply,
>
> I just merged your bug fixing.
> I think you can submit a PR for release-1.12.
>
> Best,
> Jingsong
>
> On Fri, Mar 12, 2021 at 12:22 AM Etienne Chauchot <ec...@apache.org>
> wrote:
>
>> Hi,
>>
>> I forgot to mention that I submitted the new ParquetAvroInputFormat to
>> master (1.13) but it is made to work for 1.12.x (last release) also and
>> I'm using it with Flink 1.12.x.
>>
>> Maybe it could be a good candidate to be included in an upcoming 1.12.3
>> release, WDYT ?
>>
>> Best
>>
>> Etienne
>>
>> On 11/03/2021 17:17, Etienne Chauchot wrote:
>>> Hi all,
>>>
>>> I just submitted another parquet PR that adds ParquetAvroInputFormat
>>> (I'm using it in a benchmark I'm coding). If anyone is interested in
>>> reviewing it, be my guest:
>>>
>>> https://github.com/apache/flink/pull/15156
>>>
>>> I have also an older parquet PR that fixes a format conversion bug
>>> that is waiting for merge if anyone can review it also (already 1
>>> approval of a non-committer, thanks @HuangZhenQiu
>>> <https://github.com/HuangZhenQiu>):
>>>
>>> https://github.com/apache/flink/pull/14961
>>>
>>> If I have time, I'll also tackle the other parquet tickets that I
>>> opened lately
>>>
>>> Best
>>>
>>> Etienne
>>>
>>> On 25/02/2021 08:34, Jingsong Li wrote:
>>>> Hi Etienne,
>>>>
>>>> ParquetColumnarRowInputFormat is not fully functional yet, it has a good
>>>> performance, but it is hard to support complex types, like array and
>> map...
>>>> So I think a migrated ParquetInputFormat version is required.
>>>>
>>>> Best,
>>>> Jingsong
>>>>
>>>> On Wed, Feb 24, 2021 at 3:43 PM Etienne Chauchot<ec...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Thanks guys for the comments !
>>>>>
>>>>> I did not know it was legacy. I will give the new sources a try.
>>>>>
>>>>> Jingsong, when you say "migrate ParquetInputFormat to the new
>> BulkFormat
>>>>> interface", do you mean that the new ParquetColumnarRowInputFormat is
>>>>> not fully functional yet?
>>>>>
>>>>> In the meantime, if you agree, I think I'm still gonna submit a PR for
>>>>> https://issues.apache.org/jira/browse/FLINK-21393  because I need it
>> on
>>>>> an urgent task I'm doing.
>>>>>
>>>>> Best
>>>>>
>>>>> Etienne
>>>>>
>>>>> On 24/02/2021 03:41, Peter Huang wrote:
>>>>>> Hi Jingsong,
>>>>>>
>>>>>> Thanks for pointing this out. Actually, I planned to work on changing
>>>>>> interfaces ParquetTableSource and ParquetInputFormat.
>>>>>> After refactoring the code, I may also help to fix the issue in
>>>>>> https://issues.apache.org/jira/browse/FLINK-21468.
>>>>>>
>>>>>> Best Regards
>>>>>> Peter Huang
>>>>>>
>>>>>> On Tue, Feb 23, 2021 at 6:35 PM Jingsong Li<ji...@gmail.com>
>>>>> wrote:
>>>>>>> Hi Etienne,
>>>>>>>
>>>>>>> Thanks for your reporting.
>>>>>>>
>>>>>>> There are indeed many problems. There is no doubt that we need to
>>>>> improve
>>>>>>> our current format implementation.
>>>>>>>
>>>>>>> But ParquetTableSource and ParquetInputFormat are legacy
>> implementations
>>>>>>> with legacy interfaces. We have introduced new interfaces for
>> execution
>>>>> and
>>>>>>> SQL. You can see:
>>>>>>> - ParquetColumnarRowInputFormat with BulkFormat interface. It is just
>>>>> for
>>>>>>> columnar row reading, not support complex types, we need
>>>>>>> migrate ParquetInputFormat to the new BulkFormat interface.
>>>>>>> - FileSystemTableSource with DynamicTableSource interface, It is a
>>>>> generic
>>>>>>> FileSystem source for all formats, we can just use it for parquet
>> too.
>>>>>>> Considering ParquetTableSource and ParquetInputFormat are legacy
>>>>>>> interfaces, I think we can finish migration work first, what do you
>>>>> think?
>>>>>>> Best,
>>>>>>> Jingsong
>>>>>>>
>>>>>>> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <
>> echauchot@apache.org
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I've been playing with Parquet with SQL and Avro lately. I've found
>>>>> some
>>>>>>>> bugs:
>>>>>>>>
>>>>>>>> 1.https://issues.apache.org/jira/browse/FLINK-21388  : I already
>>>>>>>> submitted a PR on this one (
>> https://github.com/apache/flink/pull/14961
>>>>> )
>>>>>>>> 2.https://issues.apache.org/jira/browse/FLINK-21389
>>>>>>>>
>>>>>>>> 3.https://issues.apache.org/jira/browse/FLINK-21468
>>>>>>>>
>>>>>>>> I've already started to work on this ticket:
>>>>>>>> https://issues.apache.org/jira/browse/FLINK-21393
>>>>>>>>
>>>>>>>>
>>>>>>>> I'd be happy to receive your comments on these tickets
>>>>>>>>
>>>>>>>>
>>>>>>>> Best
>>>>>>>>
>>>>>>>> Etienne Chauchot
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> Best, Jingsong Lee
>>>>>>>
>

Re: [Parquet support]

Posted by Jingsong Li <ji...@gmail.com>.
Hi Etienne,

Sorry for the late reply,

I just merged your bug fixing.
I think you can submit a PR for release-1.12.

Best,
Jingsong

On Fri, Mar 12, 2021 at 12:22 AM Etienne Chauchot <ec...@apache.org>
wrote:

> Hi,
>
> I forgot to mention that I submitted the new ParquetAvroInputFormat to
> master (1.13) but it is made to work for 1.12.x (last release) also and
> I'm using it with Flink 1.12.x.
>
> Maybe it could be a good candidate to be included in an upcoming 1.12.3
> release, WDYT ?
>
> Best
>
> Etienne
>
> On 11/03/2021 17:17, Etienne Chauchot wrote:
> >
> > Hi all,
> >
> > I just submitted another parquet PR that adds ParquetAvroInputFormat
> > (I'm using it in a benchmark I'm coding). If anyone is interested in
> > reviewing it, be my guest:
> >
> > https://github.com/apache/flink/pull/15156
> >
> > I have also an older parquet PR that fixes a format conversion bug
> > that is waiting for merge if anyone can review it also (already 1
> > approval of a non-committer, thanks @HuangZhenQiu
> > <https://github.com/HuangZhenQiu>):
> >
> > https://github.com/apache/flink/pull/14961
> >
> > If I have time, I'll also tackle the other parquet tickets that I
> > opened lately
> >
> > Best
> >
> > Etienne
> >
> > On 25/02/2021 08:34, Jingsong Li wrote:
> >> Hi Etienne,
> >>
> >> ParquetColumnarRowInputFormat is not fully functional yet, it has a good
> >> performance, but it is hard to support complex types, like array and
> map...
> >> So I think a migrated ParquetInputFormat version is required.
> >>
> >> Best,
> >> Jingsong
> >>
> >> On Wed, Feb 24, 2021 at 3:43 PM Etienne Chauchot<ec...@apache.org>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Thanks guys for the comments !
> >>>
> >>> I did not know it was legacy. I will give the new sources a try.
> >>>
> >>> Jingsong, when you say "migrate ParquetInputFormat to the new
> BulkFormat
> >>> interface", do you mean that the new ParquetColumnarRowInputFormat is
> >>> not fully functional yet?
> >>>
> >>> In the meantime, if you agree, I think I'm still gonna submit a PR for
> >>> https://issues.apache.org/jira/browse/FLINK-21393  because I need it
> on
> >>> an urgent task I'm doing.
> >>>
> >>> Best
> >>>
> >>> Etienne
> >>>
> >>> On 24/02/2021 03:41, Peter Huang wrote:
> >>>> Hi Jingsong,
> >>>>
> >>>> Thanks for pointing this out. Actually, I planned to work on changing
> >>>> interfaces ParquetTableSource and ParquetInputFormat.
> >>>> After refactoring the code, I may also help to fix the issue in
> >>>> https://issues.apache.org/jira/browse/FLINK-21468.
> >>>>
> >>>> Best Regards
> >>>> Peter Huang
> >>>>
> >>>> On Tue, Feb 23, 2021 at 6:35 PM Jingsong Li<ji...@gmail.com>
> >>> wrote:
> >>>>> Hi Etienne,
> >>>>>
> >>>>> Thanks for your reporting.
> >>>>>
> >>>>> There are indeed many problems. There is no doubt that we need to
> >>> improve
> >>>>> our current format implementation.
> >>>>>
> >>>>> But ParquetTableSource and ParquetInputFormat are legacy
> implementations
> >>>>> with legacy interfaces. We have introduced new interfaces for
> execution
> >>> and
> >>>>> SQL. You can see:
> >>>>> - ParquetColumnarRowInputFormat with BulkFormat interface. It is just
> >>> for
> >>>>> columnar row reading, not support complex types, we need
> >>>>> migrate ParquetInputFormat to the new BulkFormat interface.
> >>>>> - FileSystemTableSource with DynamicTableSource interface, It is a
> >>> generic
> >>>>> FileSystem source for all formats, we can just use it for parquet
> too.
> >>>>>
> >>>>> Considering ParquetTableSource and ParquetInputFormat are legacy
> >>>>> interfaces, I think we can finish migration work first, what do you
> >>> think?
> >>>>> Best,
> >>>>> Jingsong
> >>>>>
> >>>>> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <
> echauchot@apache.org
> >>>>> wrote:
> >>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I've been playing with Parquet with SQL and Avro lately. I've found
> >>> some
> >>>>>> bugs:
> >>>>>>
> >>>>>> 1.https://issues.apache.org/jira/browse/FLINK-21388  : I already
> >>>>>> submitted a PR on this one (
> https://github.com/apache/flink/pull/14961
> >>> )
> >>>>>> 2.https://issues.apache.org/jira/browse/FLINK-21389
> >>>>>>
> >>>>>> 3.https://issues.apache.org/jira/browse/FLINK-21468
> >>>>>>
> >>>>>> I've already started to work on this ticket:
> >>>>>> https://issues.apache.org/jira/browse/FLINK-21393
> >>>>>>
> >>>>>>
> >>>>>> I'd be happy to receive your comments on these tickets
> >>>>>>
> >>>>>>
> >>>>>> Best
> >>>>>>
> >>>>>> Etienne Chauchot
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>> --
> >>>>> Best, Jingsong Lee
> >>>>>
>


-- 
Best, Jingsong Lee

Re: [Parquet support]

Posted by Etienne Chauchot <ec...@apache.org>.
Hi,

I forgot to mention that I submitted the new ParquetAvroInputFormat to 
master (1.13) but it is made to work for 1.12.x (last release) also and 
I'm using it with Flink 1.12.x.

Maybe it could be a good candidate to be included in an upcoming 1.12.3 
release, WDYT ?

Best

Etienne

On 11/03/2021 17:17, Etienne Chauchot wrote:
>
> Hi all,
>
> I just submitted another parquet PR that adds ParquetAvroInputFormat 
> (I'm using it in a benchmark I'm coding). If anyone is interested in 
> reviewing it, be my guest:
>
> https://github.com/apache/flink/pull/15156
>
> I have also an older parquet PR that fixes a format conversion bug 
> that is waiting for merge if anyone can review it also (already 1 
> approval of a non-committer, thanks @HuangZhenQiu 
> <https://github.com/HuangZhenQiu>):
>
> https://github.com/apache/flink/pull/14961
>
> If I have time, I'll also tackle the other parquet tickets that I 
> opened lately
>
> Best
>
> Etienne
>
> On 25/02/2021 08:34, Jingsong Li wrote:
>> Hi Etienne,
>>
>> ParquetColumnarRowInputFormat is not fully functional yet, it has a good
>> performance, but it is hard to support complex types, like array and map...
>> So I think a migrated ParquetInputFormat version is required.
>>
>> Best,
>> Jingsong
>>
>> On Wed, Feb 24, 2021 at 3:43 PM Etienne Chauchot<ec...@apache.org>
>> wrote:
>>
>>> Hi,
>>>
>>> Thanks guys for the comments !
>>>
>>> I did not know it was legacy. I will give the new sources a try.
>>>
>>> Jingsong, when you say "migrate ParquetInputFormat to the new BulkFormat
>>> interface", do you mean that the new ParquetColumnarRowInputFormat is
>>> not fully functional yet?
>>>
>>> In the meantime, if you agree, I think I'm still gonna submit a PR for
>>> https://issues.apache.org/jira/browse/FLINK-21393  because I need it on
>>> an urgent task I'm doing.
>>>
>>> Best
>>>
>>> Etienne
>>>
>>> On 24/02/2021 03:41, Peter Huang wrote:
>>>> Hi Jingsong,
>>>>
>>>> Thanks for pointing this out. Actually, I planned to work on changing
>>>> interfaces ParquetTableSource and ParquetInputFormat.
>>>> After refactoring the code, I may also help to fix the issue in
>>>> https://issues.apache.org/jira/browse/FLINK-21468.
>>>>
>>>> Best Regards
>>>> Peter Huang
>>>>
>>>> On Tue, Feb 23, 2021 at 6:35 PM Jingsong Li<ji...@gmail.com>
>>> wrote:
>>>>> Hi Etienne,
>>>>>
>>>>> Thanks for your reporting.
>>>>>
>>>>> There are indeed many problems. There is no doubt that we need to
>>> improve
>>>>> our current format implementation.
>>>>>
>>>>> But ParquetTableSource and ParquetInputFormat are legacy implementations
>>>>> with legacy interfaces. We have introduced new interfaces for execution
>>> and
>>>>> SQL. You can see:
>>>>> - ParquetColumnarRowInputFormat with BulkFormat interface. It is just
>>> for
>>>>> columnar row reading, not support complex types, we need
>>>>> migrate ParquetInputFormat to the new BulkFormat interface.
>>>>> - FileSystemTableSource with DynamicTableSource interface, It is a
>>> generic
>>>>> FileSystem source for all formats, we can just use it for parquet too.
>>>>>
>>>>> Considering ParquetTableSource and ParquetInputFormat are legacy
>>>>> interfaces, I think we can finish migration work first, what do you
>>> think?
>>>>> Best,
>>>>> Jingsong
>>>>>
>>>>> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <echauchot@apache.org
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I've been playing with Parquet with SQL and Avro lately. I've found
>>> some
>>>>>> bugs:
>>>>>>
>>>>>> 1.https://issues.apache.org/jira/browse/FLINK-21388  : I already
>>>>>> submitted a PR on this one (https://github.com/apache/flink/pull/14961
>>> )
>>>>>> 2.https://issues.apache.org/jira/browse/FLINK-21389
>>>>>>
>>>>>> 3.https://issues.apache.org/jira/browse/FLINK-21468
>>>>>>
>>>>>> I've already started to work on this ticket:
>>>>>> https://issues.apache.org/jira/browse/FLINK-21393
>>>>>>
>>>>>>
>>>>>> I'd be happy to receive your comments on these tickets
>>>>>>
>>>>>>
>>>>>> Best
>>>>>>
>>>>>> Etienne Chauchot
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> Best, Jingsong Lee
>>>>>

Re: [Parquet support]

Posted by Etienne Chauchot <ec...@apache.org>.
Hi all,

I just submitted another parquet PR that adds ParquetAvroInputFormat 
(I'm using it in a benchmark I'm coding). If anyone is interested in 
reviewing it, be my guest:

https://github.com/apache/flink/pull/15156

I have also an older parquet PR that fixes a format conversion bug that 
is waiting for merge if anyone can review it also (already 1 approval of 
a non-committer, thanks @HuangZhenQiu <https://github.com/HuangZhenQiu>):

https://github.com/apache/flink/pull/14961

If I have time, I'll also tackle the other parquet tickets that I opened 
lately

Best

Etienne

On 25/02/2021 08:34, Jingsong Li wrote:
> Hi Etienne,
>
> ParquetColumnarRowInputFormat is not fully functional yet, it has a good
> performance, but it is hard to support complex types, like array and map...
> So I think a migrated ParquetInputFormat version is required.
>
> Best,
> Jingsong
>
> On Wed, Feb 24, 2021 at 3:43 PM Etienne Chauchot <ec...@apache.org>
> wrote:
>
>> Hi,
>>
>> Thanks guys for the comments !
>>
>> I did not know it was legacy. I will give the new sources a try.
>>
>> Jingsong, when you say "migrate ParquetInputFormat to the new BulkFormat
>> interface", do you mean that the new ParquetColumnarRowInputFormat is
>> not fully functional yet?
>>
>> In the meantime, if you agree, I think I'm still gonna submit a PR for
>> https://issues.apache.org/jira/browse/FLINK-21393 because I need it on
>> an urgent task I'm doing.
>>
>> Best
>>
>> Etienne
>>
>> On 24/02/2021 03:41, Peter Huang wrote:
>>> Hi Jingsong,
>>>
>>> Thanks for pointing this out. Actually, I planned to work on changing
>>> interfaces ParquetTableSource and ParquetInputFormat.
>>> After refactoring the code, I may also help to fix the issue in
>>> https://issues.apache.org/jira/browse/FLINK-21468.
>>>
>>> Best Regards
>>> Peter Huang
>>>
>>> On Tue, Feb 23, 2021 at 6:35 PM Jingsong Li <ji...@gmail.com>
>> wrote:
>>>> Hi Etienne,
>>>>
>>>> Thanks for your reporting.
>>>>
>>>> There are indeed many problems. There is no doubt that we need to
>> improve
>>>> our current format implementation.
>>>>
>>>> But ParquetTableSource and ParquetInputFormat are legacy implementations
>>>> with legacy interfaces. We have introduced new interfaces for execution
>> and
>>>> SQL. You can see:
>>>> - ParquetColumnarRowInputFormat with BulkFormat interface. It is just
>> for
>>>> columnar row reading, not support complex types, we need
>>>> migrate ParquetInputFormat to the new BulkFormat interface.
>>>> - FileSystemTableSource with DynamicTableSource interface, It is a
>> generic
>>>> FileSystem source for all formats, we can just use it for parquet too.
>>>>
>>>> Considering ParquetTableSource and ParquetInputFormat are legacy
>>>> interfaces, I think we can finish migration work first, what do you
>> think?
>>>> Best,
>>>> Jingsong
>>>>
>>>> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <echauchot@apache.org
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I've been playing with Parquet with SQL and Avro lately. I've found
>> some
>>>>> bugs:
>>>>>
>>>>> 1. https://issues.apache.org/jira/browse/FLINK-21388 : I already
>>>>> submitted a PR on this one (https://github.com/apache/flink/pull/14961
>> )
>>>>> 2. https://issues.apache.org/jira/browse/FLINK-21389
>>>>>
>>>>> 3. https://issues.apache.org/jira/browse/FLINK-21468
>>>>>
>>>>> I've already started to work on this ticket:
>>>>> https://issues.apache.org/jira/browse/FLINK-21393
>>>>>
>>>>>
>>>>> I'd be happy to receive your comments on these tickets
>>>>>
>>>>>
>>>>> Best
>>>>>
>>>>> Etienne Chauchot
>>>>>
>>>>>
>>>>>
>>>> --
>>>> Best, Jingsong Lee
>>>>
>

Re: [Parquet support]

Posted by Etienne Chauchot <ec...@apache.org>.
Hi all,

Jingsong, thanks it makes sense.

Besides, sorry but I found another bug in ParquetInputFormat:

https://issues.apache.org/jira/browse/FLINK-21520

For my urgent needs, I'll workaround by filtering the dataSet rather 
than applying the filter in the ParquetInputFormat at source reading time.

Etienne Chauchot

On 25/02/2021 08:34, Jingsong Li wrote:
> Hi Etienne,
>
> ParquetColumnarRowInputFormat is not fully functional yet, it has a good
> performance, but it is hard to support complex types, like array and map...
> So I think a migrated ParquetInputFormat version is required.
>
> Best,
> Jingsong
>
> On Wed, Feb 24, 2021 at 3:43 PM Etienne Chauchot <ec...@apache.org>
> wrote:
>
>> Hi,
>>
>> Thanks guys for the comments !
>>
>> I did not know it was legacy. I will give the new sources a try.
>>
>> Jingsong, when you say "migrate ParquetInputFormat to the new BulkFormat
>> interface", do you mean that the new ParquetColumnarRowInputFormat is
>> not fully functional yet?
>>
>> In the meantime, if you agree, I think I'm still gonna submit a PR for
>> https://issues.apache.org/jira/browse/FLINK-21393 because I need it on
>> an urgent task I'm doing.
>>
>> Best
>>
>> Etienne
>>
>> On 24/02/2021 03:41, Peter Huang wrote:
>>> Hi Jingsong,
>>>
>>> Thanks for pointing this out. Actually, I planned to work on changing
>>> interfaces ParquetTableSource and ParquetInputFormat.
>>> After refactoring the code, I may also help to fix the issue in
>>> https://issues.apache.org/jira/browse/FLINK-21468.
>>>
>>> Best Regards
>>> Peter Huang
>>>
>>> On Tue, Feb 23, 2021 at 6:35 PM Jingsong Li <ji...@gmail.com>
>> wrote:
>>>> Hi Etienne,
>>>>
>>>> Thanks for your reporting.
>>>>
>>>> There are indeed many problems. There is no doubt that we need to
>> improve
>>>> our current format implementation.
>>>>
>>>> But ParquetTableSource and ParquetInputFormat are legacy implementations
>>>> with legacy interfaces. We have introduced new interfaces for execution
>> and
>>>> SQL. You can see:
>>>> - ParquetColumnarRowInputFormat with BulkFormat interface. It is just
>> for
>>>> columnar row reading, not support complex types, we need
>>>> migrate ParquetInputFormat to the new BulkFormat interface.
>>>> - FileSystemTableSource with DynamicTableSource interface, It is a
>> generic
>>>> FileSystem source for all formats, we can just use it for parquet too.
>>>>
>>>> Considering ParquetTableSource and ParquetInputFormat are legacy
>>>> interfaces, I think we can finish migration work first, what do you
>> think?
>>>> Best,
>>>> Jingsong
>>>>
>>>> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <echauchot@apache.org
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I've been playing with Parquet with SQL and Avro lately. I've found
>> some
>>>>> bugs:
>>>>>
>>>>> 1. https://issues.apache.org/jira/browse/FLINK-21388 : I already
>>>>> submitted a PR on this one (https://github.com/apache/flink/pull/14961
>> )
>>>>> 2. https://issues.apache.org/jira/browse/FLINK-21389
>>>>>
>>>>> 3. https://issues.apache.org/jira/browse/FLINK-21468
>>>>>
>>>>> I've already started to work on this ticket:
>>>>> https://issues.apache.org/jira/browse/FLINK-21393
>>>>>
>>>>>
>>>>> I'd be happy to receive your comments on these tickets
>>>>>
>>>>>
>>>>> Best
>>>>>
>>>>> Etienne Chauchot
>>>>>
>>>>>
>>>>>
>>>> --
>>>> Best, Jingsong Lee
>>>>
>

Re: [Parquet support]

Posted by Jingsong Li <ji...@gmail.com>.
Hi Etienne,

ParquetColumnarRowInputFormat is not fully functional yet, it has a good
performance, but it is hard to support complex types, like array and map...
So I think a migrated ParquetInputFormat version is required.

Best,
Jingsong

On Wed, Feb 24, 2021 at 3:43 PM Etienne Chauchot <ec...@apache.org>
wrote:

> Hi,
>
> Thanks guys for the comments !
>
> I did not know it was legacy. I will give the new sources a try.
>
> Jingsong, when you say "migrate ParquetInputFormat to the new BulkFormat
> interface", do you mean that the new ParquetColumnarRowInputFormat is
> not fully functional yet?
>
> In the meantime, if you agree, I think I'm still gonna submit a PR for
> https://issues.apache.org/jira/browse/FLINK-21393 because I need it on
> an urgent task I'm doing.
>
> Best
>
> Etienne
>
> On 24/02/2021 03:41, Peter Huang wrote:
> > Hi Jingsong,
> >
> > Thanks for pointing this out. Actually, I planned to work on changing
> > interfaces ParquetTableSource and ParquetInputFormat.
> > After refactoring the code, I may also help to fix the issue in
> > https://issues.apache.org/jira/browse/FLINK-21468.
> >
> > Best Regards
> > Peter Huang
> >
> > On Tue, Feb 23, 2021 at 6:35 PM Jingsong Li <ji...@gmail.com>
> wrote:
> >
> >> Hi Etienne,
> >>
> >> Thanks for your reporting.
> >>
> >> There are indeed many problems. There is no doubt that we need to
> improve
> >> our current format implementation.
> >>
> >> But ParquetTableSource and ParquetInputFormat are legacy implementations
> >> with legacy interfaces. We have introduced new interfaces for execution
> and
> >> SQL. You can see:
> >> - ParquetColumnarRowInputFormat with BulkFormat interface. It is just
> for
> >> columnar row reading, not support complex types, we need
> >> migrate ParquetInputFormat to the new BulkFormat interface.
> >> - FileSystemTableSource with DynamicTableSource interface, It is a
> generic
> >> FileSystem source for all formats, we can just use it for parquet too.
> >>
> >> Considering ParquetTableSource and ParquetInputFormat are legacy
> >> interfaces, I think we can finish migration work first, what do you
> think?
> >>
> >> Best,
> >> Jingsong
> >>
> >> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <echauchot@apache.org
> >
> >> wrote:
> >>
> >>> Hi all,
> >>>
> >>> I've been playing with Parquet with SQL and Avro lately. I've found
> some
> >>> bugs:
> >>>
> >>> 1. https://issues.apache.org/jira/browse/FLINK-21388 : I already
> >>> submitted a PR on this one (https://github.com/apache/flink/pull/14961
> )
> >>>
> >>> 2. https://issues.apache.org/jira/browse/FLINK-21389
> >>>
> >>> 3. https://issues.apache.org/jira/browse/FLINK-21468
> >>>
> >>> I've already started to work on this ticket:
> >>> https://issues.apache.org/jira/browse/FLINK-21393
> >>>
> >>>
> >>> I'd be happy to receive your comments on these tickets
> >>>
> >>>
> >>> Best
> >>>
> >>> Etienne Chauchot
> >>>
> >>>
> >>>
> >> --
> >> Best, Jingsong Lee
> >>
>


-- 
Best, Jingsong Lee

Re: [Parquet support]

Posted by Etienne Chauchot <ec...@apache.org>.
Hi,

Thanks guys for the comments !

I did not know it was legacy. I will give the new sources a try.

Jingsong, when you say "migrate ParquetInputFormat to the new BulkFormat 
interface", do you mean that the new ParquetColumnarRowInputFormat is 
not fully functional yet?

In the meantime, if you agree, I think I'm still gonna submit a PR for 
https://issues.apache.org/jira/browse/FLINK-21393 because I need it on 
an urgent task I'm doing.

Best

Etienne

On 24/02/2021 03:41, Peter Huang wrote:
> Hi Jingsong,
>
> Thanks for pointing this out. Actually, I planned to work on changing
> interfaces ParquetTableSource and ParquetInputFormat.
> After refactoring the code, I may also help to fix the issue in
> https://issues.apache.org/jira/browse/FLINK-21468.
>
> Best Regards
> Peter Huang
>
> On Tue, Feb 23, 2021 at 6:35 PM Jingsong Li <ji...@gmail.com> wrote:
>
>> Hi Etienne,
>>
>> Thanks for your reporting.
>>
>> There are indeed many problems. There is no doubt that we need to improve
>> our current format implementation.
>>
>> But ParquetTableSource and ParquetInputFormat are legacy implementations
>> with legacy interfaces. We have introduced new interfaces for execution and
>> SQL. You can see:
>> - ParquetColumnarRowInputFormat with BulkFormat interface. It is just for
>> columnar row reading, not support complex types, we need
>> migrate ParquetInputFormat to the new BulkFormat interface.
>> - FileSystemTableSource with DynamicTableSource interface, It is a generic
>> FileSystem source for all formats, we can just use it for parquet too.
>>
>> Considering ParquetTableSource and ParquetInputFormat are legacy
>> interfaces, I think we can finish migration work first, what do you think?
>>
>> Best,
>> Jingsong
>>
>> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <ec...@apache.org>
>> wrote:
>>
>>> Hi all,
>>>
>>> I've been playing with Parquet with SQL and Avro lately. I've found some
>>> bugs:
>>>
>>> 1. https://issues.apache.org/jira/browse/FLINK-21388 : I already
>>> submitted a PR on this one (https://github.com/apache/flink/pull/14961)
>>>
>>> 2. https://issues.apache.org/jira/browse/FLINK-21389
>>>
>>> 3. https://issues.apache.org/jira/browse/FLINK-21468
>>>
>>> I've already started to work on this ticket:
>>> https://issues.apache.org/jira/browse/FLINK-21393
>>>
>>>
>>> I'd be happy to receive your comments on these tickets
>>>
>>>
>>> Best
>>>
>>> Etienne Chauchot
>>>
>>>
>>>
>> --
>> Best, Jingsong Lee
>>

Re: [Parquet support]

Posted by Peter Huang <hu...@gmail.com>.
Hi Jingsong,

Thanks for pointing this out. Actually, I planned to work on changing
interfaces ParquetTableSource and ParquetInputFormat.
After refactoring the code, I may also help to fix the issue in
https://issues.apache.org/jira/browse/FLINK-21468.

Best Regards
Peter Huang

On Tue, Feb 23, 2021 at 6:35 PM Jingsong Li <ji...@gmail.com> wrote:

> Hi Etienne,
>
> Thanks for your reporting.
>
> There are indeed many problems. There is no doubt that we need to improve
> our current format implementation.
>
> But ParquetTableSource and ParquetInputFormat are legacy implementations
> with legacy interfaces. We have introduced new interfaces for execution and
> SQL. You can see:
> - ParquetColumnarRowInputFormat with BulkFormat interface. It is just for
> columnar row reading, not support complex types, we need
> migrate ParquetInputFormat to the new BulkFormat interface.
> - FileSystemTableSource with DynamicTableSource interface, It is a generic
> FileSystem source for all formats, we can just use it for parquet too.
>
> Considering ParquetTableSource and ParquetInputFormat are legacy
> interfaces, I think we can finish migration work first, what do you think?
>
> Best,
> Jingsong
>
> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <ec...@apache.org>
> wrote:
>
> > Hi all,
> >
> > I've been playing with Parquet with SQL and Avro lately. I've found some
> > bugs:
> >
> > 1. https://issues.apache.org/jira/browse/FLINK-21388 : I already
> > submitted a PR on this one (https://github.com/apache/flink/pull/14961)
> >
> > 2. https://issues.apache.org/jira/browse/FLINK-21389
> >
> > 3. https://issues.apache.org/jira/browse/FLINK-21468
> >
> > I've already started to work on this ticket:
> > https://issues.apache.org/jira/browse/FLINK-21393
> >
> >
> > I'd be happy to receive your comments on these tickets
> >
> >
> > Best
> >
> > Etienne Chauchot
> >
> >
> >
>
> --
> Best, Jingsong Lee
>

Re: [Parquet support]

Posted by Etienne Chauchot <ec...@apache.org>.
Thanks Arvid !

Etienne

On 21/05/2021 12:45, Arvid Heise wrote:
> Hi Etienne,
>
> I'm taking over and just left you a review. Sorry for the long delays.
>
> Best,
>
> Arvid
>
> On Fri, May 21, 2021 at 11:25 AM Etienne Chauchot <ec...@apache.org>
> wrote:
>
>> Hi all,
>>
>> considering (see my email below) that the DataStream API is not fully
>> functional yet (batch mode) and considering that the new sources are
>> only available on DataStream API,  can we merge this PR (1) about
>> existing sources in DataSet API? It has received already one LGTM.
>>
>> [1] https://github.com/apache/flink/pull/15156
>>
>> anyone ?
>>
>> Etienne
>>
>> On 06/05/2021 14:23, Etienne Chauchot wrote:
>>> Hi,
>>>
>>> @Jingsong, I agree that adding a new feature (ParquetAvroInputFormat)
>>> to an old source API is a maintenance burden. But IMHO I think that
>>> while the new DataStream batch/streaming convergent API is not 100%
>>> functional we still need to maintain older sources and add missing
>>> features to them.
>>>
>>> Indeed, I realized that DataStream API in batch mode (1) does not
>>> support aggregations yet (2) so in such a case a user would stick to
>>> the DataSet API. And the new FileSource API with
>>> ParquetColumnarRowInputFormat is not available in DataSet API (3).
>>>
>>> So, long story short, in some cases a user will have no other choice
>>> than using ParquetInputFormat and legacy source.
>>>
>>> WDYT ?
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-19316
>>>
>>> [2] https://issues.apache.org/jira/browse/FLINK-22587
>>>
>>> [3]
>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-Compatibility,Deprecation,andMigrationPlan
>>> Best,
>>>
>>> Etienne
>>>
>>> On 24/02/2021 03:35, Jingsong Li wrote:
>>>> Hi Etienne,
>>>>
>>>> Thanks for your reporting.
>>>>
>>>> There are indeed many problems. There is no doubt that we need to
>>>> improve
>>>> our current format implementation.
>>>>
>>>> But ParquetTableSource and ParquetInputFormat are legacy implementations
>>>> with legacy interfaces. We have introduced new interfaces for
>>>> execution and
>>>> SQL. You can see:
>>>> - ParquetColumnarRowInputFormat with BulkFormat interface. It is just
>>>> for
>>>> columnar row reading, not support complex types, we need
>>>> migrate ParquetInputFormat to the new BulkFormat interface.
>>>> - FileSystemTableSource with DynamicTableSource interface, It is a
>>>> generic
>>>> FileSystem source for all formats, we can just use it for parquet too.
>>>>
>>>> Considering ParquetTableSource and ParquetInputFormat are legacy
>>>> interfaces, I think we can finish migration work first, what do you
>>>> think?
>>>>
>>>> Best,
>>>> Jingsong
>>>>
>>>> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <echauchot@apache.org
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I've been playing with Parquet with SQL and Avro lately. I've found
>>>>> some
>>>>> bugs:
>>>>>
>>>>> 1. https://issues.apache.org/jira/browse/FLINK-21388 : I already
>>>>> submitted a PR on this one (https://github.com/apache/flink/pull/14961
>> )
>>>>> 2. https://issues.apache.org/jira/browse/FLINK-21389
>>>>>
>>>>> 3. https://issues.apache.org/jira/browse/FLINK-21468
>>>>>
>>>>> I've already started to work on this ticket:
>>>>> https://issues.apache.org/jira/browse/FLINK-21393
>>>>>
>>>>>
>>>>> I'd be happy to receive your comments on these tickets
>>>>>
>>>>>
>>>>> Best
>>>>>
>>>>> Etienne Chauchot
>>>>>
>>>>>
>>>>>

Re: [Parquet support]

Posted by Arvid Heise <ar...@apache.org>.
Hi Etienne,

I'm taking over and just left you a review. Sorry for the long delays.

Best,

Arvid

On Fri, May 21, 2021 at 11:25 AM Etienne Chauchot <ec...@apache.org>
wrote:

> Hi all,
>
> considering (see my email below) that the DataStream API is not fully
> functional yet (batch mode) and considering that the new sources are
> only available on DataStream API,  can we merge this PR (1) about
> existing sources in DataSet API? It has received already one LGTM.
>
> [1] https://github.com/apache/flink/pull/15156
>
> anyone ?
>
> Etienne
>
> On 06/05/2021 14:23, Etienne Chauchot wrote:
> > Hi,
> >
> > @Jingsong, I agree that adding a new feature (ParquetAvroInputFormat)
> > to an old source API is a maintenance burden. But IMHO I think that
> > while the new DataStream batch/streaming convergent API is not 100%
> > functional we still need to maintain older sources and add missing
> > features to them.
> >
> > Indeed, I realized that DataStream API in batch mode (1) does not
> > support aggregations yet (2) so in such a case a user would stick to
> > the DataSet API. And the new FileSource API with
> > ParquetColumnarRowInputFormat is not available in DataSet API (3).
> >
> > So, long story short, in some cases a user will have no other choice
> > than using ParquetInputFormat and legacy source.
> >
> > WDYT ?
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-19316
> >
> > [2] https://issues.apache.org/jira/browse/FLINK-22587
> >
> > [3]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-Compatibility,Deprecation,andMigrationPlan
> >
> > Best,
> >
> > Etienne
> >
> > On 24/02/2021 03:35, Jingsong Li wrote:
> >> Hi Etienne,
> >>
> >> Thanks for your reporting.
> >>
> >> There are indeed many problems. There is no doubt that we need to
> >> improve
> >> our current format implementation.
> >>
> >> But ParquetTableSource and ParquetInputFormat are legacy implementations
> >> with legacy interfaces. We have introduced new interfaces for
> >> execution and
> >> SQL. You can see:
> >> - ParquetColumnarRowInputFormat with BulkFormat interface. It is just
> >> for
> >> columnar row reading, not support complex types, we need
> >> migrate ParquetInputFormat to the new BulkFormat interface.
> >> - FileSystemTableSource with DynamicTableSource interface, It is a
> >> generic
> >> FileSystem source for all formats, we can just use it for parquet too.
> >>
> >> Considering ParquetTableSource and ParquetInputFormat are legacy
> >> interfaces, I think we can finish migration work first, what do you
> >> think?
> >>
> >> Best,
> >> Jingsong
> >>
> >> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <echauchot@apache.org
> >
> >> wrote:
> >>
> >>> Hi all,
> >>>
> >>> I've been playing with Parquet with SQL and Avro lately. I've found
> >>> some
> >>> bugs:
> >>>
> >>> 1. https://issues.apache.org/jira/browse/FLINK-21388 : I already
> >>> submitted a PR on this one (https://github.com/apache/flink/pull/14961
> )
> >>>
> >>> 2. https://issues.apache.org/jira/browse/FLINK-21389
> >>>
> >>> 3. https://issues.apache.org/jira/browse/FLINK-21468
> >>>
> >>> I've already started to work on this ticket:
> >>> https://issues.apache.org/jira/browse/FLINK-21393
> >>>
> >>>
> >>> I'd be happy to receive your comments on these tickets
> >>>
> >>>
> >>> Best
> >>>
> >>> Etienne Chauchot
> >>>
> >>>
> >>>
>

Re: [Parquet support]

Posted by Etienne Chauchot <ec...@apache.org>.
Hi all,

considering (see my email below) that the DataStream API is not fully 
functional yet (batch mode) and considering that the new sources are 
only available on DataStream API,  can we merge this PR (1) about 
existing sources in DataSet API? It has received already one LGTM.

[1] https://github.com/apache/flink/pull/15156

anyone ?

Etienne

On 06/05/2021 14:23, Etienne Chauchot wrote:
> Hi,
>
> @Jingsong, I agree that adding a new feature (ParquetAvroInputFormat) 
> to an old source API is a maintenance burden. But IMHO I think that 
> while the new DataStream batch/streaming convergent API is not 100% 
> functional we still need to maintain older sources and add missing 
> features to them.
>
> Indeed, I realized that DataStream API in batch mode (1) does not 
> support aggregations yet (2) so in such a case a user would stick to 
> the DataSet API. And the new FileSource API with 
> ParquetColumnarRowInputFormat is not available in DataSet API (3).
>
> So, long story short, in some cases a user will have no other choice 
> than using ParquetInputFormat and legacy source.
>
> WDYT ?
>
> [1] https://issues.apache.org/jira/browse/FLINK-19316
>
> [2] https://issues.apache.org/jira/browse/FLINK-22587
>
> [3] 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-Compatibility,Deprecation,andMigrationPlan
>
> Best,
>
> Etienne
>
> On 24/02/2021 03:35, Jingsong Li wrote:
>> Hi Etienne,
>>
>> Thanks for your reporting.
>>
>> There are indeed many problems. There is no doubt that we need to 
>> improve
>> our current format implementation.
>>
>> But ParquetTableSource and ParquetInputFormat are legacy implementations
>> with legacy interfaces. We have introduced new interfaces for 
>> execution and
>> SQL. You can see:
>> - ParquetColumnarRowInputFormat with BulkFormat interface. It is just 
>> for
>> columnar row reading, not support complex types, we need
>> migrate ParquetInputFormat to the new BulkFormat interface.
>> - FileSystemTableSource with DynamicTableSource interface, It is a 
>> generic
>> FileSystem source for all formats, we can just use it for parquet too.
>>
>> Considering ParquetTableSource and ParquetInputFormat are legacy
>> interfaces, I think we can finish migration work first, what do you 
>> think?
>>
>> Best,
>> Jingsong
>>
>> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <ec...@apache.org>
>> wrote:
>>
>>> Hi all,
>>>
>>> I've been playing with Parquet with SQL and Avro lately. I've found 
>>> some
>>> bugs:
>>>
>>> 1. https://issues.apache.org/jira/browse/FLINK-21388 : I already
>>> submitted a PR on this one (https://github.com/apache/flink/pull/14961)
>>>
>>> 2. https://issues.apache.org/jira/browse/FLINK-21389
>>>
>>> 3. https://issues.apache.org/jira/browse/FLINK-21468
>>>
>>> I've already started to work on this ticket:
>>> https://issues.apache.org/jira/browse/FLINK-21393
>>>
>>>
>>> I'd be happy to receive your comments on these tickets
>>>
>>>
>>> Best
>>>
>>> Etienne Chauchot
>>>
>>>
>>>

Re: [Parquet support]

Posted by Etienne Chauchot <ec...@apache.org>.
Hi,

@Jingsong, I agree that adding a new feature (ParquetAvroInputFormat) to 
an old source API is a maintenance burden. But IMHO I think that while 
the new DataStream batch/streaming convergent API is not 100% functional 
we still need to maintain older sources and add missing features to them.

Indeed, I realized that DataStream API in batch mode (1) does not 
support aggregations yet (2) so in such a case a user would stick to the 
DataSet API. And the new FileSource API with 
ParquetColumnarRowInputFormat is not available in DataSet API (3).

So, long story short, in some cases a user will have no other choice 
than using ParquetInputFormat and legacy source.

WDYT ?

[1] https://issues.apache.org/jira/browse/FLINK-19316

[2] https://issues.apache.org/jira/browse/FLINK-22587

[3] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-Compatibility,Deprecation,andMigrationPlan

Best,

Etienne

On 24/02/2021 03:35, Jingsong Li wrote:
> Hi Etienne,
>
> Thanks for your reporting.
>
> There are indeed many problems. There is no doubt that we need to improve
> our current format implementation.
>
> But ParquetTableSource and ParquetInputFormat are legacy implementations
> with legacy interfaces. We have introduced new interfaces for execution and
> SQL. You can see:
> - ParquetColumnarRowInputFormat with BulkFormat interface. It is just for
> columnar row reading, not support complex types, we need
> migrate ParquetInputFormat to the new BulkFormat interface.
> - FileSystemTableSource with DynamicTableSource interface, It is a generic
> FileSystem source for all formats, we can just use it for parquet too.
>
> Considering ParquetTableSource and ParquetInputFormat are legacy
> interfaces, I think we can finish migration work first, what do you think?
>
> Best,
> Jingsong
>
> On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <ec...@apache.org>
> wrote:
>
>> Hi all,
>>
>> I've been playing with Parquet with SQL and Avro lately. I've found some
>> bugs:
>>
>> 1. https://issues.apache.org/jira/browse/FLINK-21388 : I already
>> submitted a PR on this one (https://github.com/apache/flink/pull/14961)
>>
>> 2. https://issues.apache.org/jira/browse/FLINK-21389
>>
>> 3. https://issues.apache.org/jira/browse/FLINK-21468
>>
>> I've already started to work on this ticket:
>> https://issues.apache.org/jira/browse/FLINK-21393
>>
>>
>> I'd be happy to receive your comments on these tickets
>>
>>
>> Best
>>
>> Etienne Chauchot
>>
>>
>>

Re: [Parquet support]

Posted by Jingsong Li <ji...@gmail.com>.
Hi Etienne,

Thanks for your reporting.

There are indeed many problems. There is no doubt that we need to improve
our current format implementation.

But ParquetTableSource and ParquetInputFormat are legacy implementations
with legacy interfaces. We have introduced new interfaces for execution and
SQL. You can see:
- ParquetColumnarRowInputFormat with BulkFormat interface. It is just for
columnar row reading, not support complex types, we need
migrate ParquetInputFormat to the new BulkFormat interface.
- FileSystemTableSource with DynamicTableSource interface, It is a generic
FileSystem source for all formats, we can just use it for parquet too.

Considering ParquetTableSource and ParquetInputFormat are legacy
interfaces, I think we can finish migration work first, what do you think?

Best,
Jingsong

On Wed, Feb 24, 2021 at 12:46 AM Etienne Chauchot <ec...@apache.org>
wrote:

> Hi all,
>
> I've been playing with Parquet with SQL and Avro lately. I've found some
> bugs:
>
> 1. https://issues.apache.org/jira/browse/FLINK-21388 : I already
> submitted a PR on this one (https://github.com/apache/flink/pull/14961)
>
> 2. https://issues.apache.org/jira/browse/FLINK-21389
>
> 3. https://issues.apache.org/jira/browse/FLINK-21468
>
> I've already started to work on this ticket:
> https://issues.apache.org/jira/browse/FLINK-21393
>
>
> I'd be happy to receive your comments on these tickets
>
>
> Best
>
> Etienne Chauchot
>
>
>

-- 
Best, Jingsong Lee