You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Sourigna Phetsarath <gn...@teamaol.com> on 2016/03/21 00:11:11 UTC

Flink 1.0.0 reading files from multiple directory with wildcards

All,

Do any of the Flink Data Sources support comma separated directories with
wildcards?

For example:

env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,/data/2016/01/03/*/*
")


Thanks in advance for any help that you can provide.
-- 


*Gna Phetsarath*System Architect // AOL Platforms // Data Services //
Applied Research Chapter
770 Broadway, 5th Floor, New York, NY 10003
o: 212.402.4871 // m: 917.373.7363
vvmr: 8890237 aim: sphetsarath20 t: @sourigna

* <http://www.aolplatforms.com>*

Re: Flink 1.0.0 reading files from multiple directory with wildcards

Posted by Sourigna Phetsarath <gn...@teamaol.com>.
Great!  I will, once I clear it with the legal team here.

On Wed, Mar 23, 2016 at 6:19 AM, Ufuk Celebi <uc...@apache.org> wrote:

> Nice! Would you like to contribute this to Flink via a pull request? Some
> resources about the contribution process can be found here:
>
> http://flink.apache.org/contribute-code.html
> http://flink.apache.org/how-to-contribute.html
>
> On Wed, Mar 23, 2016 at 12:00 AM, Fabian Hueske <fh...@gmail.com> wrote:
>
>> Hi Gna,
>>
>> thanks for sharing the good news and opening the JIRA!
>>
>> Cheers, Fabian
>>
>> 2016-03-22 23:30 GMT+01:00 Sourigna Phetsarath <
>> gna.phetsarath@teamaol.com>:
>>
>>> Ufek & Fabian,
>>>
>>> FYI,  I was about to extend the FileInputFormat and extend the createInputSplits
>>> to handle multiple Path - there was an improvement of reduced resource
>>> usage and increased performance of the job.
>>>
>>> Also added this ticket: https://issues.apache.org/jira/browse/FLINK-3655
>>>
>>> -Gna
>>>
>>> On Mon, Mar 21, 2016 at 10:04 AM, Sourigna Phetsarath <
>>> gna.phetsarath@teamaol.com> wrote:
>>>
>>>> Fabian,
>>>>
>>>> I'll try extending InputFormat as you suggested and will create a JIRA
>>>> issue as well.
>>>>
>>>> I also have an AvroGenericRecordInput format class that I would like to
>>>> contribute once I have time to clean it up and get it into your code base.
>>>>
>>>> -Gna
>>>>
>>>> On Mon, Mar 21, 2016 at 6:35 AM, Fabian Hueske <fh...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> no, this is currently not supported. However, I agree this would be a
>>>>> very valuable addition to the FileInputFormat.
>>>>> Would you mind opening a JIRA issue with your suggestions?
>>>>>
>>>>> Until this is added to Flink, it can be implemented as a custom
>>>>> InputFormat based on FileInputFormat by overriding the createInputSplits()
>>>>> method.
>>>>>
>>>>> Best, Fabian
>>>>>
>>>>> 2016-03-21 0:11 GMT+01:00 Sourigna Phetsarath <
>>>>> gna.phetsarath@teamaol.com>:
>>>>>
>>>>>> All,
>>>>>>
>>>>>> Do any of the Flink Data Sources support comma separated directories
>>>>>> with wildcards?
>>>>>>
>>>>>> For example:
>>>>>>
>>>>>> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,
>>>>>> /data/2016/01/03/*/*")
>>>>>>
>>>>>>
>>>>>> Thanks in advance for any help that you can provide.
>>>>>> --
>>>>>>
>>>>>>
>>>>>> *Gna Phetsarath*System Architect // AOL Platforms // Data Services
>>>>>> // Applied Research Chapter
>>>>>> 770 Broadway, 5th Floor, New York, NY 10003
>>>>>> o: 212.402.4871 // m: 917.373.7363
>>>>>> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>>>>>>
>>>>>> * <http://www.aolplatforms.com>*
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
>>>> Applied Research Chapter
>>>> 770 Broadway, 5th Floor, New York, NY 10003
>>>> o: 212.402.4871 // m: 917.373.7363
>>>> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>>>>
>>>> * <http://www.aolplatforms.com>*
>>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
>>> Applied Research Chapter
>>> 770 Broadway, 5th Floor, New York, NY 10003
>>> o: 212.402.4871 // m: 917.373.7363
>>> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>>>
>>> * <http://www.aolplatforms.com>*
>>>
>>
>>
>


-- 


*Gna Phetsarath*System Architect // AOL Platforms // Data Services //
Applied Research Chapter
770 Broadway, 5th Floor, New York, NY 10003
o: 212.402.4871 // m: 917.373.7363
vvmr: 8890237 aim: sphetsarath20 t: @sourigna

* <http://www.aolplatforms.com>*

Re: Flink 1.0.0 reading files from multiple directory with wildcards

Posted by Ufuk Celebi <uc...@apache.org>.
Nice! Would you like to contribute this to Flink via a pull request? Some
resources about the contribution process can be found here:

http://flink.apache.org/contribute-code.html
http://flink.apache.org/how-to-contribute.html

On Wed, Mar 23, 2016 at 12:00 AM, Fabian Hueske <fh...@gmail.com> wrote:

> Hi Gna,
>
> thanks for sharing the good news and opening the JIRA!
>
> Cheers, Fabian
>
> 2016-03-22 23:30 GMT+01:00 Sourigna Phetsarath <gna.phetsarath@teamaol.com
> >:
>
>> Ufek & Fabian,
>>
>> FYI,  I was about to extend the FileInputFormat and extend the createInputSplits
>> to handle multiple Path - there was an improvement of reduced resource
>> usage and increased performance of the job.
>>
>> Also added this ticket: https://issues.apache.org/jira/browse/FLINK-3655
>>
>> -Gna
>>
>> On Mon, Mar 21, 2016 at 10:04 AM, Sourigna Phetsarath <
>> gna.phetsarath@teamaol.com> wrote:
>>
>>> Fabian,
>>>
>>> I'll try extending InputFormat as you suggested and will create a JIRA
>>> issue as well.
>>>
>>> I also have an AvroGenericRecordInput format class that I would like to
>>> contribute once I have time to clean it up and get it into your code base.
>>>
>>> -Gna
>>>
>>> On Mon, Mar 21, 2016 at 6:35 AM, Fabian Hueske <fh...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> no, this is currently not supported. However, I agree this would be a
>>>> very valuable addition to the FileInputFormat.
>>>> Would you mind opening a JIRA issue with your suggestions?
>>>>
>>>> Until this is added to Flink, it can be implemented as a custom
>>>> InputFormat based on FileInputFormat by overriding the createInputSplits()
>>>> method.
>>>>
>>>> Best, Fabian
>>>>
>>>> 2016-03-21 0:11 GMT+01:00 Sourigna Phetsarath <
>>>> gna.phetsarath@teamaol.com>:
>>>>
>>>>> All,
>>>>>
>>>>> Do any of the Flink Data Sources support comma separated directories
>>>>> with wildcards?
>>>>>
>>>>> For example:
>>>>>
>>>>> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,
>>>>> /data/2016/01/03/*/*")
>>>>>
>>>>>
>>>>> Thanks in advance for any help that you can provide.
>>>>> --
>>>>>
>>>>>
>>>>> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
>>>>> Applied Research Chapter
>>>>> 770 Broadway, 5th Floor, New York, NY 10003
>>>>> o: 212.402.4871 // m: 917.373.7363
>>>>> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>>>>>
>>>>> * <http://www.aolplatforms.com>*
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
>>> Applied Research Chapter
>>> 770 Broadway, 5th Floor, New York, NY 10003
>>> o: 212.402.4871 // m: 917.373.7363
>>> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>>>
>>> * <http://www.aolplatforms.com>*
>>>
>>
>>
>>
>> --
>>
>>
>> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
>> Applied Research Chapter
>> 770 Broadway, 5th Floor, New York, NY 10003
>> o: 212.402.4871 // m: 917.373.7363
>> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>>
>> * <http://www.aolplatforms.com>*
>>
>
>

Re: Flink 1.0.0 reading files from multiple directory with wildcards

Posted by Fabian Hueske <fh...@gmail.com>.
Hi Gna,

thanks for sharing the good news and opening the JIRA!

Cheers, Fabian

2016-03-22 23:30 GMT+01:00 Sourigna Phetsarath <gn...@teamaol.com>:

> Ufek & Fabian,
>
> FYI,  I was about to extend the FileInputFormat and extend the createInputSplits
> to handle multiple Path - there was an improvement of reduced resource
> usage and increased performance of the job.
>
> Also added this ticket: https://issues.apache.org/jira/browse/FLINK-3655
>
> -Gna
>
> On Mon, Mar 21, 2016 at 10:04 AM, Sourigna Phetsarath <
> gna.phetsarath@teamaol.com> wrote:
>
>> Fabian,
>>
>> I'll try extending InputFormat as you suggested and will create a JIRA
>> issue as well.
>>
>> I also have an AvroGenericRecordInput format class that I would like to
>> contribute once I have time to clean it up and get it into your code base.
>>
>> -Gna
>>
>> On Mon, Mar 21, 2016 at 6:35 AM, Fabian Hueske <fh...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> no, this is currently not supported. However, I agree this would be a
>>> very valuable addition to the FileInputFormat.
>>> Would you mind opening a JIRA issue with your suggestions?
>>>
>>> Until this is added to Flink, it can be implemented as a custom
>>> InputFormat based on FileInputFormat by overriding the createInputSplits()
>>> method.
>>>
>>> Best, Fabian
>>>
>>> 2016-03-21 0:11 GMT+01:00 Sourigna Phetsarath <
>>> gna.phetsarath@teamaol.com>:
>>>
>>>> All,
>>>>
>>>> Do any of the Flink Data Sources support comma separated directories
>>>> with wildcards?
>>>>
>>>> For example:
>>>>
>>>> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,
>>>> /data/2016/01/03/*/*")
>>>>
>>>>
>>>> Thanks in advance for any help that you can provide.
>>>> --
>>>>
>>>>
>>>> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
>>>> Applied Research Chapter
>>>> 770 Broadway, 5th Floor, New York, NY 10003
>>>> o: 212.402.4871 // m: 917.373.7363
>>>> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>>>>
>>>> * <http://www.aolplatforms.com>*
>>>>
>>>
>>>
>>
>>
>> --
>>
>>
>> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
>> Applied Research Chapter
>> 770 Broadway, 5th Floor, New York, NY 10003
>> o: 212.402.4871 // m: 917.373.7363
>> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>>
>> * <http://www.aolplatforms.com>*
>>
>
>
>
> --
>
>
> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
> Applied Research Chapter
> 770 Broadway, 5th Floor, New York, NY 10003
> o: 212.402.4871 // m: 917.373.7363
> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>
> * <http://www.aolplatforms.com>*
>

Re: Flink 1.0.0 reading files from multiple directory with wildcards

Posted by Sourigna Phetsarath <gn...@teamaol.com>.
Ufek & Fabian,

FYI,  I was about to extend the FileInputFormat and extend the
createInputSplits
to handle multiple Path - there was an improvement of reduced resource
usage and increased performance of the job.

Also added this ticket: https://issues.apache.org/jira/browse/FLINK-3655

-Gna

On Mon, Mar 21, 2016 at 10:04 AM, Sourigna Phetsarath <
gna.phetsarath@teamaol.com> wrote:

> Fabian,
>
> I'll try extending InputFormat as you suggested and will create a JIRA
> issue as well.
>
> I also have an AvroGenericRecordInput format class that I would like to
> contribute once I have time to clean it up and get it into your code base.
>
> -Gna
>
> On Mon, Mar 21, 2016 at 6:35 AM, Fabian Hueske <fh...@gmail.com> wrote:
>
>> Hi,
>>
>> no, this is currently not supported. However, I agree this would be a
>> very valuable addition to the FileInputFormat.
>> Would you mind opening a JIRA issue with your suggestions?
>>
>> Until this is added to Flink, it can be implemented as a custom
>> InputFormat based on FileInputFormat by overriding the createInputSplits()
>> method.
>>
>> Best, Fabian
>>
>> 2016-03-21 0:11 GMT+01:00 Sourigna Phetsarath <gna.phetsarath@teamaol.com
>> >:
>>
>>> All,
>>>
>>> Do any of the Flink Data Sources support comma separated directories
>>> with wildcards?
>>>
>>> For example:
>>>
>>> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,
>>> /data/2016/01/03/*/*")
>>>
>>>
>>> Thanks in advance for any help that you can provide.
>>> --
>>>
>>>
>>> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
>>> Applied Research Chapter
>>> 770 Broadway, 5th Floor, New York, NY 10003
>>> o: 212.402.4871 // m: 917.373.7363
>>> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>>>
>>> * <http://www.aolplatforms.com>*
>>>
>>
>>
>
>
> --
>
>
> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
> Applied Research Chapter
> 770 Broadway, 5th Floor, New York, NY 10003
> o: 212.402.4871 // m: 917.373.7363
> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>
> * <http://www.aolplatforms.com>*
>



-- 


*Gna Phetsarath*System Architect // AOL Platforms // Data Services //
Applied Research Chapter
770 Broadway, 5th Floor, New York, NY 10003
o: 212.402.4871 // m: 917.373.7363
vvmr: 8890237 aim: sphetsarath20 t: @sourigna

* <http://www.aolplatforms.com>*

Re: Flink 1.0.0 reading files from multiple directory with wildcards

Posted by Sourigna Phetsarath <gn...@teamaol.com>.
Fabian,

I'll try extending InputFormat as you suggested and will create a JIRA
issue as well.

I also have an AvroGenericRecordInput format class that I would like to
contribute once I have time to clean it up and get it into your code base.

-Gna

On Mon, Mar 21, 2016 at 6:35 AM, Fabian Hueske <fh...@gmail.com> wrote:

> Hi,
>
> no, this is currently not supported. However, I agree this would be a very
> valuable addition to the FileInputFormat.
> Would you mind opening a JIRA issue with your suggestions?
>
> Until this is added to Flink, it can be implemented as a custom
> InputFormat based on FileInputFormat by overriding the createInputSplits()
> method.
>
> Best, Fabian
>
> 2016-03-21 0:11 GMT+01:00 Sourigna Phetsarath <gn...@teamaol.com>
> :
>
>> All,
>>
>> Do any of the Flink Data Sources support comma separated directories with
>> wildcards?
>>
>> For example:
>>
>> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,
>> /data/2016/01/03/*/*")
>>
>>
>> Thanks in advance for any help that you can provide.
>> --
>>
>>
>> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
>> Applied Research Chapter
>> 770 Broadway, 5th Floor, New York, NY 10003
>> o: 212.402.4871 // m: 917.373.7363
>> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>>
>> * <http://www.aolplatforms.com>*
>>
>
>


-- 


*Gna Phetsarath*System Architect // AOL Platforms // Data Services //
Applied Research Chapter
770 Broadway, 5th Floor, New York, NY 10003
o: 212.402.4871 // m: 917.373.7363
vvmr: 8890237 aim: sphetsarath20 t: @sourigna

* <http://www.aolplatforms.com>*

Re: Flink 1.0.0 reading files from multiple directory with wildcards

Posted by Sourigna Phetsarath <gn...@teamaol.com>.
Thanks Ufuk, I'm already using the recursive traversal feature.

On Mon, Mar 21, 2016 at 8:39 AM, Ufuk Celebi <uc...@apache.org> wrote:

> If you want all sub directories under data/2016/01, then this could help:
> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#recursive-traversal-of-the-input-path-directory
>
> On Mon, Mar 21, 2016 at 11:35 AM, Fabian Hueske <fh...@gmail.com> wrote:
>
>> Hi,
>>
>> no, this is currently not supported. However, I agree this would be a
>> very valuable addition to the FileInputFormat.
>> Would you mind opening a JIRA issue with your suggestions?
>>
>> Until this is added to Flink, it can be implemented as a custom
>> InputFormat based on FileInputFormat by overriding the createInputSplits()
>> method.
>>
>> Best, Fabian
>>
>> 2016-03-21 0:11 GMT+01:00 Sourigna Phetsarath <gna.phetsarath@teamaol.com
>> >:
>>
>>> All,
>>>
>>> Do any of the Flink Data Sources support comma separated directories
>>> with wildcards?
>>>
>>> For example:
>>>
>>> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,
>>> /data/2016/01/03/*/*")
>>>
>>>
>>> Thanks in advance for any help that you can provide.
>>> --
>>>
>>>
>>> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
>>> Applied Research Chapter
>>> 770 Broadway, 5th Floor, New York, NY 10003
>>> o: 212.402.4871 // m: 917.373.7363
>>> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>>>
>>> * <http://www.aolplatforms.com>*
>>>
>>
>>
>


-- 


*Gna Phetsarath*System Architect // AOL Platforms // Data Services //
Applied Research Chapter
770 Broadway, 5th Floor, New York, NY 10003
o: 212.402.4871 // m: 917.373.7363
vvmr: 8890237 aim: sphetsarath20 t: @sourigna

* <http://www.aolplatforms.com>*

Re: Flink 1.0.0 reading files from multiple directory with wildcards

Posted by Ufuk Celebi <uc...@apache.org>.
If you want all sub directories under data/2016/01, then this could help:
https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#recursive-traversal-of-the-input-path-directory

On Mon, Mar 21, 2016 at 11:35 AM, Fabian Hueske <fh...@gmail.com> wrote:

> Hi,
>
> no, this is currently not supported. However, I agree this would be a very
> valuable addition to the FileInputFormat.
> Would you mind opening a JIRA issue with your suggestions?
>
> Until this is added to Flink, it can be implemented as a custom
> InputFormat based on FileInputFormat by overriding the createInputSplits()
> method.
>
> Best, Fabian
>
> 2016-03-21 0:11 GMT+01:00 Sourigna Phetsarath <gn...@teamaol.com>
> :
>
>> All,
>>
>> Do any of the Flink Data Sources support comma separated directories with
>> wildcards?
>>
>> For example:
>>
>> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,
>> /data/2016/01/03/*/*")
>>
>>
>> Thanks in advance for any help that you can provide.
>> --
>>
>>
>> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
>> Applied Research Chapter
>> 770 Broadway, 5th Floor, New York, NY 10003
>> o: 212.402.4871 // m: 917.373.7363
>> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>>
>> * <http://www.aolplatforms.com>*
>>
>
>

Re: Flink 1.0.0 reading files from multiple directory with wildcards

Posted by Fabian Hueske <fh...@gmail.com>.
Hi,

no, this is currently not supported. However, I agree this would be a very
valuable addition to the FileInputFormat.
Would you mind opening a JIRA issue with your suggestions?

Until this is added to Flink, it can be implemented as a custom InputFormat
based on FileInputFormat by overriding the createInputSplits() method.

Best, Fabian

2016-03-21 0:11 GMT+01:00 Sourigna Phetsarath <gn...@teamaol.com>:

> All,
>
> Do any of the Flink Data Sources support comma separated directories with
> wildcards?
>
> For example:
>
> env.readFile("/data/2016/01/01/*/*,/data/2016/01/02/*/*,
> /data/2016/01/03/*/*")
>
>
> Thanks in advance for any help that you can provide.
> --
>
>
> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
> Applied Research Chapter
> 770 Broadway, 5th Floor, New York, NY 10003
> o: 212.402.4871 // m: 917.373.7363
> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>
> * <http://www.aolplatforms.com>*
>