You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Camelia-Elena Ciolac <ca...@inria.fr> on 2014/10/24 12:08:24 UTC
Collection of files as input
Hello,
I am working on a use case where we have a collections of files as input.
I am using the env.createInput based on AvroInputFormat. For one input file, it is fine to specify it in new Path(args[0]).
But, it is possible (and if yes, how) to create a DataSet based on a collection of files directly?
I thought of a workaround of building one DataSet dsUnion to be the union result,
and a second DataSet dsCurrent where we create an input for one file.
read first file in dsUnion
in a loop, repeat:
read another file in dsCurrent
dsUnion = dsUnion.union(dsCurrent)
until all files in the collection are processed.
Is there a simpler solution possible with Flink API?
Thanks in advance!
Camelia
Re: Collection of files as input
Posted by Fabian Hueske <fh...@apache.org>.
Hi Camelia,
FileInputFormats such as the AvroInputFormat can also read all files in a
directory if this is specified as the path.
Hope that helps.
Best, Fabian
2014-10-24 12:08 GMT+02:00 Camelia-Elena Ciolac <
camelia-elena.ciolac@inria.fr>:
> Hello,
>
> I am working on a use case where we have a collections of files as input.
> I am using the env.createInput based on AvroInputFormat. For one input
> file, it is fine to specify it in new Path(args[0]).
> But, it is possible (and if yes, how) to create a DataSet based on a
> collection of files directly?
>
> I thought of a workaround of building one DataSet dsUnion to be the union
> result,
> and a second DataSet
> dsCurrent where we create an input for one file.
>
> read first file in dsUnion
>
> in a loop, repeat:
> read another file in dsCurrent
> dsUnion = dsUnion.union(dsCurrent)
> until all files in the collection are processed.
>
> Is there a simpler solution possible with Flink API?
>
> Thanks in advance!
> Camelia
>
>
>