You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Chirag Dewan <ch...@yahoo.in> on 2018/05/28 04:35:35 UTC

Large number of sources in Flink Job

Hi,
I am working on a use case where my Flink job needs to collect data from thousands of sources. 
As an example, I want to collect data from more than 2000 File Directories, process(filter, transform) the data and distribute the processed data streams to 200 different directories.
Are there any caveats I should know with such large number of sources, also taking into account per operator parallelism? 
Regards,
Chirag

Re: Large number of sources in Flink Job

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Chirag,

There have been some issue with very large execution graphs.
You might need to adjust the default configuration and configure larger
Akka buffers and/or timeouts.

Also, 2000 sources means that you run at least 2000 threads at once.

The FileInputFormat (and most of its sub-classes) in Flink 1.5.0 can be
configured to accept multiple directories.
This would be a preferred approach to creating one source per directory.

Best, Fabian

2018-05-28 6:35 GMT+02:00 Chirag Dewan <ch...@yahoo.in>:

> Hi,
>
> I am working on a use case where my Flink job needs to collect data from
> thousands of sources.
>
> As an example, I want to collect data from more than 2000 File
> Directories, process(filter, transform) the data and distribute the
> processed data streams to 200 different directories.
>
> Are there any caveats I should know with such large number of sources,
> also taking into account per operator parallelism?
>
> Regards,
>
> Chirag
>
>