You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Fabian Hueske <fh...@gmail.com> on 2017/09/04 07:58:30 UTC

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

Hi,

readFile() requests a FileInputFormat, i.e., your custom InputFormat would
need to extend FileInputFormat.
In general, any InputFormat decides about what to read when generating
InputSplits. In your case the, createInputSplits() method should return one
InputSplit for each file it wants to read.
By default, FileInputFormat creates one or more input splits for each file
in a directory. If you only want to read a subset of files (or have a list
of files to read), you should override the method and return exactly one
input split for each file to read (because your files cannot be read in
parallel).

If your InputFormat does not extend FileInputFormat, you can use
createInput() instead of readFile().

Best, Fabian

2017-08-31 21:24 GMT+02:00 ShB <sh...@gmail.com>:

> Hi Fabian,
>
> Thanks for your response.
>
> If I implemented my own InputFormat, how would I read a specific list of
> files from S3?
>
> Assuming I need to use readFile(), below would read all of the files from
> the specified S3 bucket or path:
> env.readFile(MyInputFormat, "s3://my-bucket/")
>
> Is there a way for me to read only a specific list/subset of files(say
> fileList) from a S3 bucket, in parallel using readFile?
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>