You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Etienne Chauchot <ec...@apache.org> on 2021/03/25 12:12:42 UTC

Glob support on file access

Hi all,

In case it is useful to some of you:

I have a big batch that needs to use globs (*.parquet for example) to 
read input files. It seems that globs do not work out of the box (see 
https://issues.apache.org/jira/browse/FLINK-6417)

But there is a workaround:


final  FileInputFormat inputFormat =new  FileInputFormat(new  Path(extractDir(filePath)));/* or any subclass of FileInputFormat*/  /*extact parent dir*/
inputFormat.setFilesFilter(new  GlobFilePathFilter(Collections.singletonList(filePath), Collections.emptyList()));/*filePath contains glob, the whole path needs to be provided to 
GlobFilePathFilter*/
inputFormat.setNestedFileEnumeration(true);

Hope, it helps some people

Etienne Chauchot



Re: Glob support on file access

Posted by Etienne Chauchot <ec...@apache.org>.
Hi Arvid,

Thanks for your answer. Yes I know that DataSet API is to be deprecated. 
But still it is a small PR (2 lines of production code) so I guess we 
could merge it for users that still use DataSet API for batch.

@Timo can I assign you as reviewer on the PR ?

Best

Etienne

On 29/03/2021 20:05, Arvid Heise wrote:
> Hi Etienne,
>
> In general, any small PR on this subject is very welcome. I don't think
> that the community as a whole will invest much into FileInputFormat as the
> whole DataSet API is phasing out.
>
> Afaik SQL and Table API are only using InputFormat for the legacy
> compatibility layer (e.g. when it comes to translating into DataSet). All
> the new batchy stuff is based on BulkFormat and unified source/sink
> interface. I'm CC'ing Timo who can correct me if I'm wrong.
>
> So if you just want to add glob support on FileInputFormat /only/ for SQL
> and Table API, I don't think it's worth the effort. It would be more
> interesting to see if the new FileSource does support it properly and
> rather add it there.
>
> On Mon, Mar 29, 2021 at 4:57 PM Etienne Chauchot <ec...@apache.org>
> wrote:
>
>> But still this workaround would only work when you have access to the
>> underlying /FileInputFormat/. For//SQL and Table APIs, you don't so
>> you'll be unable to apply this workaround. So what we could do is make a
>> PR to support glob at the FileInputFormat level to profit for all APIs.
>>
>> I'm gonna do it if everyone agrees.
>>
>> Best
>>
>> Etienne Chauchot
>>
>> On 25/03/2021 13:12, Etienne Chauchot wrote:
>>> Hi all,
>>>
>>> In case it is useful to some of you:
>>>
>>> I have a big batch that needs to use globs (*.parquet for example) to
>>> read input files. It seems that globs do not work out of the box (see
>>> https://issues.apache.org/jira/browse/FLINK-6417)
>>>
>>> But there is a workaround:
>>>
>>>
>>> final  FileInputFormat inputFormat =new  FileInputFormat(new
>> Path(extractDir(filePath)));/* or any subclass of FileInputFormat*/
>> /*extact parent dir*/
>>> inputFormat.setFilesFilter(new
>> GlobFilePathFilter(Collections.singletonList(filePath),
>> Collections.emptyList()));/*filePath contains glob, the whole path needs to
>> be provided to
>>> GlobFilePathFilter*/
>>> inputFormat.setNestedFileEnumeration(true);
>>>
>>> Hope, it helps some people
>>>
>>> Etienne Chauchot
>>>
>>>

Re: Glob support on file access

Posted by Etienne Chauchot <ec...@apache.org>.
Hi,

FYI: I just submitted the small PR to support this: 
https://github.com/apache/flink/pull/15436

Best

Etienne

On 29/03/2021 20:05, Arvid Heise wrote:
> Hi Etienne,
>
> In general, any small PR on this subject is very welcome. I don't think
> that the community as a whole will invest much into FileInputFormat as the
> whole DataSet API is phasing out.
>
> Afaik SQL and Table API are only using InputFormat for the legacy
> compatibility layer (e.g. when it comes to translating into DataSet). All
> the new batchy stuff is based on BulkFormat and unified source/sink
> interface. I'm CC'ing Timo who can correct me if I'm wrong.
>
> So if you just want to add glob support on FileInputFormat /only/ for SQL
> and Table API, I don't think it's worth the effort. It would be more
> interesting to see if the new FileSource does support it properly and
> rather add it there.
>
> On Mon, Mar 29, 2021 at 4:57 PM Etienne Chauchot <ec...@apache.org>
> wrote:
>
>> But still this workaround would only work when you have access to the
>> underlying /FileInputFormat/. For//SQL and Table APIs, you don't so
>> you'll be unable to apply this workaround. So what we could do is make a
>> PR to support glob at the FileInputFormat level to profit for all APIs.
>>
>> I'm gonna do it if everyone agrees.
>>
>> Best
>>
>> Etienne Chauchot
>>
>> On 25/03/2021 13:12, Etienne Chauchot wrote:
>>> Hi all,
>>>
>>> In case it is useful to some of you:
>>>
>>> I have a big batch that needs to use globs (*.parquet for example) to
>>> read input files. It seems that globs do not work out of the box (see
>>> https://issues.apache.org/jira/browse/FLINK-6417)
>>>
>>> But there is a workaround:
>>>
>>>
>>> final  FileInputFormat inputFormat =new  FileInputFormat(new
>> Path(extractDir(filePath)));/* or any subclass of FileInputFormat*/
>> /*extact parent dir*/
>>> inputFormat.setFilesFilter(new
>> GlobFilePathFilter(Collections.singletonList(filePath),
>> Collections.emptyList()));/*filePath contains glob, the whole path needs to
>> be provided to
>>> GlobFilePathFilter*/
>>> inputFormat.setNestedFileEnumeration(true);
>>>
>>> Hope, it helps some people
>>>
>>> Etienne Chauchot
>>>
>>>

Re: Glob support on file access

Posted by Arvid Heise <ar...@apache.org>.
Hi Etienne,

In general, any small PR on this subject is very welcome. I don't think
that the community as a whole will invest much into FileInputFormat as the
whole DataSet API is phasing out.

Afaik SQL and Table API are only using InputFormat for the legacy
compatibility layer (e.g. when it comes to translating into DataSet). All
the new batchy stuff is based on BulkFormat and unified source/sink
interface. I'm CC'ing Timo who can correct me if I'm wrong.

So if you just want to add glob support on FileInputFormat /only/ for SQL
and Table API, I don't think it's worth the effort. It would be more
interesting to see if the new FileSource does support it properly and
rather add it there.

On Mon, Mar 29, 2021 at 4:57 PM Etienne Chauchot <ec...@apache.org>
wrote:

> But still this workaround would only work when you have access to the
> underlying /FileInputFormat/. For//SQL and Table APIs, you don't so
> you'll be unable to apply this workaround. So what we could do is make a
> PR to support glob at the FileInputFormat level to profit for all APIs.
>
> I'm gonna do it if everyone agrees.
>
> Best
>
> Etienne Chauchot
>
> On 25/03/2021 13:12, Etienne Chauchot wrote:
> >
> > Hi all,
> >
> > In case it is useful to some of you:
> >
> > I have a big batch that needs to use globs (*.parquet for example) to
> > read input files. It seems that globs do not work out of the box (see
> > https://issues.apache.org/jira/browse/FLINK-6417)
> >
> > But there is a workaround:
> >
> >
> > final  FileInputFormat inputFormat =new  FileInputFormat(new
> Path(extractDir(filePath)));/* or any subclass of FileInputFormat*/
> /*extact parent dir*/
> > inputFormat.setFilesFilter(new
> GlobFilePathFilter(Collections.singletonList(filePath),
> Collections.emptyList()));/*filePath contains glob, the whole path needs to
> be provided to
> > GlobFilePathFilter*/
> > inputFormat.setNestedFileEnumeration(true);
> >
> > Hope, it helps some people
> >
> > Etienne Chauchot
> >
> >
>

Re: Glob support on file access

Posted by Arvid Heise <ar...@apache.org>.
Hi Etienne,

In general, any small PR on this subject is very welcome. I don't think
that the community as a whole will invest much into FileInputFormat as the
whole DataSet API is phasing out.

Afaik SQL and Table API are only using InputFormat for the legacy
compatibility layer (e.g. when it comes to translating into DataSet). All
the new batchy stuff is based on BulkFormat and unified source/sink
interface. I'm CC'ing Timo who can correct me if I'm wrong.

So if you just want to add glob support on FileInputFormat /only/ for SQL
and Table API, I don't think it's worth the effort. It would be more
interesting to see if the new FileSource does support it properly and
rather add it there.

On Mon, Mar 29, 2021 at 4:57 PM Etienne Chauchot <ec...@apache.org>
wrote:

> But still this workaround would only work when you have access to the
> underlying /FileInputFormat/. For//SQL and Table APIs, you don't so
> you'll be unable to apply this workaround. So what we could do is make a
> PR to support glob at the FileInputFormat level to profit for all APIs.
>
> I'm gonna do it if everyone agrees.
>
> Best
>
> Etienne Chauchot
>
> On 25/03/2021 13:12, Etienne Chauchot wrote:
> >
> > Hi all,
> >
> > In case it is useful to some of you:
> >
> > I have a big batch that needs to use globs (*.parquet for example) to
> > read input files. It seems that globs do not work out of the box (see
> > https://issues.apache.org/jira/browse/FLINK-6417)
> >
> > But there is a workaround:
> >
> >
> > final  FileInputFormat inputFormat =new  FileInputFormat(new
> Path(extractDir(filePath)));/* or any subclass of FileInputFormat*/
> /*extact parent dir*/
> > inputFormat.setFilesFilter(new
> GlobFilePathFilter(Collections.singletonList(filePath),
> Collections.emptyList()));/*filePath contains glob, the whole path needs to
> be provided to
> > GlobFilePathFilter*/
> > inputFormat.setNestedFileEnumeration(true);
> >
> > Hope, it helps some people
> >
> > Etienne Chauchot
> >
> >
>

Re: Glob support on file access

Posted by Etienne Chauchot <ec...@apache.org>.
But still this workaround would only work when you have access to the 
underlying /FileInputFormat/. For//SQL and Table APIs, you don't so 
you'll be unable to apply this workaround. So what we could do is make a 
PR to support glob at the FileInputFormat level to profit for all APIs.

I'm gonna do it if everyone agrees.

Best

Etienne Chauchot

On 25/03/2021 13:12, Etienne Chauchot wrote:
>
> Hi all,
>
> In case it is useful to some of you:
>
> I have a big batch that needs to use globs (*.parquet for example) to 
> read input files. It seems that globs do not work out of the box (see 
> https://issues.apache.org/jira/browse/FLINK-6417)
>
> But there is a workaround:
>
>
> final  FileInputFormat inputFormat =new  FileInputFormat(new  Path(extractDir(filePath)));/* or any subclass of FileInputFormat*/  /*extact parent dir*/
> inputFormat.setFilesFilter(new  GlobFilePathFilter(Collections.singletonList(filePath), Collections.emptyList()));/*filePath contains glob, the whole path needs to be provided to 
> GlobFilePathFilter*/
> inputFormat.setNestedFileEnumeration(true);
>
> Hope, it helps some people
>
> Etienne Chauchot
>
>

Re: Glob support on file access

Posted by Etienne Chauchot <ec...@apache.org>.
But still this workaround would only work when you have access to the 
underlying /FileInputFormat/. For//SQL and Table APIs, you don't so 
you'll be unable to apply this workaround. So what we could do is make a 
PR to support glob at the FileInputFormat level to profit for all APIs.

I'm gonna do it if everyone agrees.

Best

Etienne Chauchot

On 25/03/2021 13:12, Etienne Chauchot wrote:
>
> Hi all,
>
> In case it is useful to some of you:
>
> I have a big batch that needs to use globs (*.parquet for example) to 
> read input files. It seems that globs do not work out of the box (see 
> https://issues.apache.org/jira/browse/FLINK-6417)
>
> But there is a workaround:
>
>
> final  FileInputFormat inputFormat =new  FileInputFormat(new  Path(extractDir(filePath)));/* or any subclass of FileInputFormat*/  /*extact parent dir*/
> inputFormat.setFilesFilter(new  GlobFilePathFilter(Collections.singletonList(filePath), Collections.emptyList()));/*filePath contains glob, the whole path needs to be provided to 
> GlobFilePathFilter*/
> inputFormat.setNestedFileEnumeration(true);
>
> Hope, it helps some people
>
> Etienne Chauchot
>
>