You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Meghajit Mazumdar <me...@gojek.com> on 2022/01/27 08:56:20 UTC

Does FileSource download all remote files for generating splits

Hello,

I had a question about the FileSource in Flink 1.14
<https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/connector/file/src/FileSource.html>
.

Considering FileSource is set to read from a remote GCS URL, I could read
and understand that the FileEnumerator is actually responsible for
discovering the files under the URL.

However, how does the FileSource, and thus the FileEnumerator, generate the
splits when a remote URL is used ? Does it:
1. download all the files eagerly and then generate the splits ?, or
2. only downloads and generates the splits when the source reader asks for
splits ?, or
3. doesn't download but only streams the data from the remote as required ?

Would be great if somebody could help me out. Thanks !

*Regards,*
*Meghajit*

Re: Does FileSource download all remote files for generating splits

Posted by Meghajit Mazumdar <me...@gojek.com>.
Thanks Caizhi. This clarifies.

On Fri, Jan 28, 2022 at 12:06 PM Caizhi Weng <ts...@gmail.com> wrote:

> Hi!
>
> FileEnumerator never reads the actual content of a file. FileEnumerator
> lives in job managers and it only reads the necessary meta-data of the file
> (for example how large is the file) so that it can split the work across
> all task managers. Corresponding file readers, in the other hand, lives in
> task managers and perform the exact reading work. They accept file splits
> assigned to them and read the contents corresponding to these splits.
>
> Meghajit Mazumdar <me...@gojek.com> 于2022年1月27日周四 16:57写道:
>
>> Hello,
>>
>> I had a question about the FileSource in Flink 1.14
>> <https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/connector/file/src/FileSource.html>
>> .
>>
>> Considering FileSource is set to read from a remote GCS URL, I could read
>> and understand that the FileEnumerator is actually responsible for
>> discovering the files under the URL.
>>
>> However, how does the FileSource, and thus the FileEnumerator, generate
>> the splits when a remote URL is used ? Does it:
>> 1. download all the files eagerly and then generate the splits ?, or
>> 2. only downloads and generates the splits when the source reader asks
>> for splits ?, or
>> 3. doesn't download but only streams the data from the remote as required
>> ?
>>
>> Would be great if somebody could help me out. Thanks !
>>
>> *Regards,*
>> *Meghajit*
>>
>

-- 
*Regards,*
*Meghajit*

Re: Does FileSource download all remote files for generating splits

Posted by Caizhi Weng <ts...@gmail.com>.
Hi!

FileEnumerator never reads the actual content of a file. FileEnumerator
lives in job managers and it only reads the necessary meta-data of the file
(for example how large is the file) so that it can split the work across
all task managers. Corresponding file readers, in the other hand, lives in
task managers and perform the exact reading work. They accept file splits
assigned to them and read the contents corresponding to these splits.

Meghajit Mazumdar <me...@gojek.com> 于2022年1月27日周四 16:57写道:

> Hello,
>
> I had a question about the FileSource in Flink 1.14
> <https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/connector/file/src/FileSource.html>
> .
>
> Considering FileSource is set to read from a remote GCS URL, I could read
> and understand that the FileEnumerator is actually responsible for
> discovering the files under the URL.
>
> However, how does the FileSource, and thus the FileEnumerator, generate
> the splits when a remote URL is used ? Does it:
> 1. download all the files eagerly and then generate the splits ?, or
> 2. only downloads and generates the splits when the source reader asks for
> splits ?, or
> 3. doesn't download but only streams the data from the remote as required ?
>
> Would be great if somebody could help me out. Thanks !
>
> *Regards,*
> *Meghajit*
>