You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Jakes John <ja...@gmail.com> on 2017/08/08 00:58:23 UTC

Flink streaming Parallelism

       I am coming from Apache Storm world.  I am planning to switch from
storm to flink. I was reading Flink documentation but, I couldn't find some
requirements in Flink which was present in Storm.

I need to have a streaming pipeline  Kafka->flink-> ElasticSearch.  In
storm,  I have seen that I can specify number of tasks per bolt.  Typically
databases are slow in writes and hence I need more writers to the
database.  Reading from kafka is pretty fast when compared to ES writes.
This means that I need to have more ES writer tasks than Kafka consumers.
How can I achieve it in Flink?  What are the concepts in Flink similar to
Storm Parallelism concepts like workers, executors, tasks?
        I saw the implementation of elasticsearch sink in Flink which can
do batching of messsges before writes. How can I batch data based on a
custom logic? For eg: batch writes  grouped on one of the message keys.
This is possible in Storm via FieldGrouping. But I couldn't find an
equivalent way to do grouping in Flink and control the overall number of
writes to ES.

Please help me with above questions and some pointers to flink parallelism.

Re: Flink streaming Parallelism

Posted by "Tzu-Li (Gordon) Tai" <tz...@apache.org>.

Hi,

The equivalent would be setting a parallelism on your sink operator. e.g. stream.addSink(…).setParallelism(…).
By default the parallelism of all operators in the pipeline will be whatever parallelism was set for the whole job, unless parallelism is explicitly set for a specific operator. For more details on the distributed runtime concepts you can take a look at [1]

I saw the implementation of elasticsearch sink in Flink which can do batching of messsges before writes. How can I batch data based on a custom logic? For eg: batch writes grouped on one of the message keys. This is possible in Storm via FieldGrouping.
The equivalent of partitioning streams in Flink is `stream.keyBy(…)`. All messages of the same key would then go to the same parallel downstream operator instance. If its an ElasticsearchSink, then following a keyBy all messages of the same key will be batched by the same ElasticSearch writer.

Hope this helps!

Cheers,
Gordon

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/concepts/runtime.html

On 8 August 2017 at 8:58:30 AM, Jakes John (jakesjohn12345@gmail.com) wrote:

I am coming from Apache Storm world. I am planning to switch from storm to flink. I was reading Flink documentation but, I couldn't find some requirements in Flink which was present in Storm.

I need to have a streaming pipeline Kafka->flink-> ElasticSearch. In storm, I have seen that I can specify number of tasks per bolt. Typically databases are slow in writes and hence I need more writers to the database. Reading from kafka is pretty fast when compared to ES writes. This means that I need to have more ES writer tasks than Kafka consumers. How can I achieve it in Flink? What are the concepts in Flink similar to Storm Parallelism concepts like workers, executors, tasks?
I saw the implementation of elasticsearch sink in Flink which can do batching of messsges before writes. How can I batch data based on a custom logic? For eg: batch writes grouped on one of the message keys. This is possible in Storm via FieldGrouping. But I couldn't find an equivalent way to do grouping in Flink and control the overall number of writes to ES.

Please help me with above questions and some pointers to flink parallelism.