You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Kirti Dhar Upadhyay K via user <us...@flink.apache.org> on 2023/04/20 12:29:36 UTC

SplitEnumerator and SourceReader

Hi Community,

I have started using file source of Flink 1.17.x recently.
I was going through the FLIP-27 documentation and as much I understand SplitEnumerator lists files (splits) and assigns to SourceReader. A single instance of SplitEnumerator  runs whereas parallelism can be done on SourceReader side. I have below queries on same:


  1.  Who actually downloads the file (let's say the file is on S3)? Is it SplitEnumerator which downloads the files and then assign the splits to SourceReaders OR it only lists and give the path of file in split to SourceReader, which downloads the file and process?


  1.  Is the complete file downloaded in one go? OR chunked downloading is also possible?



  1.  I got that SplitEnumerator can be run on JobManager OR on single instance of TaskManager. How a user can configure it where to run?



  1.  Is there any memory footprint impact if FileSource is running in streaming mode (continuous streaming)?


Thanks for any help!

Regards,
Kirti Dhar

RE: SplitEnumerator and SourceReader

Posted by Kirti Dhar Upadhyay K via user <us...@flink.apache.org>.
Thanks a lot Martijn for quick response.

For point 3, I might got confused on below link:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-where_run_enumerator

Anyways,  thanks for clarifying all things.

Just a further question on
“Yes, because the enumerator needs to remember the paths of all currently processed files. Depending on the use case, that can grow to be big. This is documented at https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/filesystem/#current-limitations”

Is there any recommendation for this limitation like size of files or number of files or checkpointing state backend etc?

Regards,
Kirti Dhar

From: Martijn Visser <ma...@apache.org>
Sent: 20 April 2023 18:14
To: Kirti Dhar Upadhyay K <ki...@ericsson.com>
Cc: user@flink.apache.org
Subject: Re: SplitEnumerator and SourceReader

Hi Kirti Dhar,

1. The SourceReader downloads the file, which is assigned to him by the SplitEnumerator
2. This depends on the format; a BulkFormat like Parquet or ORC can be read in batches of records at a time.
3. The SplitEnumerator runs on the JobManager, not on a TaskManager. Have you read something different in the documentation?
4. Yes, because the enumerator needs to remember the paths of all currently processed files. Depending on the use case, that can grow to be big. This is documented at https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/filesystem/#current-limitations

Best regards,

Martijn



On Thu, Apr 20, 2023 at 2:30 PM Kirti Dhar Upadhyay K via user <us...@flink.apache.org>> wrote:
Hi Community,

I have started using file source of Flink 1.17.x recently.
I was going through the FLIP-27 documentation and as much I understand SplitEnumerator lists files (splits) and assigns to SourceReader. A single instance of SplitEnumerator  runs whereas parallelism can be done on SourceReader side. I have below queries on same:


  1.  Who actually downloads the file (let’s say the file is on S3)? Is it SplitEnumerator which downloads the files and then assign the splits to SourceReaders OR it only lists and give the path of file in split to SourceReader, which downloads the file and process?


  1.  Is the complete file downloaded in one go? OR chunked downloading is also possible?



  1.  I got that SplitEnumerator can be run on JobManager OR on single instance of TaskManager. How a user can configure it where to run?



  1.  Is there any memory footprint impact if FileSource is running in streaming mode (continuous streaming)?


Thanks for any help!

Regards,
Kirti Dhar

Re: SplitEnumerator and SourceReader

Posted by Martijn Visser <ma...@apache.org>.
Hi Kirti Dhar,

1. The SourceReader downloads the file, which is assigned to him by the
SplitEnumerator
2. This depends on the format; a BulkFormat like Parquet or ORC can be read
in batches of records at a time.
3. The SplitEnumerator runs on the JobManager, not on a TaskManager. Have
you read something different in the documentation?
4. Yes, because the enumerator needs to remember the paths of all currently
processed files. Depending on the use case, that can grow to be big. This
is documented at
https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/filesystem/#current-limitations

Best regards,

Martijn



On Thu, Apr 20, 2023 at 2:30 PM Kirti Dhar Upadhyay K via user <
user@flink.apache.org> wrote:

> Hi Community,
>
>
>
> I have started using file source of Flink 1.17.x recently.
>
> I was going through the FLIP-27 documentation and as much I understand
> SplitEnumerator lists files (splits) and assigns to SourceReader. A single
> instance of SplitEnumerator  runs whereas parallelism can be done on
> SourceReader side. I have below queries on same:
>
>
>
>    1. Who actually downloads the file (let’s say the file is on S3)? Is
>    it SplitEnumerator which downloads the files and then assign the splits to
>    SourceReaders OR it only lists and give the path of file in split to
>    SourceReader, which downloads the file and process?
>
>
>
>    1. Is the complete file downloaded in one go? OR chunked downloading
>    is also possible?
>
>
>
>    1. I got that SplitEnumerator can be run on JobManager OR on single
>    instance of TaskManager. How a user can configure it where to run?
>
>
>
>    1. Is there any memory footprint impact if FileSource is running in
>    streaming mode (continuous streaming)?
>
>
>
> Thanks for any help!
>
>
>
> Regards,
>
> Kirti Dhar
>