You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Chirag Dewan <ch...@yahoo.in> on 2018/05/28 04:35:35 UTC
Large number of sources in Flink Job
Hi,
I am working on a use case where my Flink job needs to collect data from thousands of sources.
As an example, I want to collect data from more than 2000 File Directories, process(filter, transform) the data and distribute the processed data streams to 200 different directories.
Are there any caveats I should know with such large number of sources, also taking into account per operator parallelism?
Regards,
Chirag
Re: Large number of sources in Flink Job
Posted by Fabian Hueske <fh...@gmail.com>.
Hi Chirag,
There have been some issue with very large execution graphs.
You might need to adjust the default configuration and configure larger
Akka buffers and/or timeouts.
Also, 2000 sources means that you run at least 2000 threads at once.
The FileInputFormat (and most of its sub-classes) in Flink 1.5.0 can be
configured to accept multiple directories.
This would be a preferred approach to creating one source per directory.
Best, Fabian
2018-05-28 6:35 GMT+02:00 Chirag Dewan <ch...@yahoo.in>:
> Hi,
>
> I am working on a use case where my Flink job needs to collect data from
> thousands of sources.
>
> As an example, I want to collect data from more than 2000 File
> Directories, process(filter, transform) the data and distribute the
> processed data streams to 200 different directories.
>
> Are there any caveats I should know with such large number of sources,
> also taking into account per operator parallelism?
>
> Regards,
>
> Chirag
>
>