You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Harshvardhan Agrawal <ha...@gmail.com> on 2018/05/15 18:20:48 UTC

Controlling parallelism of a ParDo Transform while writing to DB

Hi Guys,

I am currently in the process of developing a pipeline using Apache Beam
with Flink as an execution engine. As a part of the process I read data
from Kafka and perform a bunch of transformations that involve joins,
aggregations as well as lookups to an external DB.

The idea is that we want to have higher parallelism with Flink when we are
performing the aggregations but eventually coalesce the data and have
lesser number of processes writing to the DB so that the target DB can
handle it (for example say I want to have a parallelism of 40 for
aggregations but only 10 when writing to target DB).

Is there any way we could do that in Beam?

Regards,

Harsh
-- 

*Regards,Harshvardhan Agrawal*
*267.991.6618 | LinkedIn <https://www.linkedin.com/in/harshvardhanagr/>*

Re: Controlling parallelism of a ParDo Transform while writing to DB

Posted by Jins George <ji...@aeris.net>.

I am a user running beam+flink.  Flink runner currently exposes only 
the  job level parallelism, not at an operator level.  This is a really 
nice feature if can be supported.
Flink's Datastream api provide that option though.

Thanks,
Jins George

On 05/16/2018 10:24 PM, Chamikara Jayalath wrote:
> Exact mechanism of controlling parallelism is runner specific. Looks 
> like Flink allows users to to specify the amount of parallelism (per 
> job) using following option.
> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineOptions.java#L65
>
> I'm not sure if Flink allows more finer grained control.
>
> On Wed, May 16, 2018 at 5:48 PM Harshvardhan Agrawal 
> <harshvardhan.agr93@gmail.com <ma...@gmail.com>> 
> wrote:
>
>     How do we control parallelism of a particular step then? Is there
>     a recommended approach to solve this problem?
>
>     On Wed, May 16, 2018 at 20:45 Chamikara Jayalath
>     <chamikara@google.com <ma...@google.com>> wrote:
>
>         I don't think this can be specified through Beam API but Flink
>         runner might have additional configurations that I'm not aware
>         of. Also, many runners fuse steps to improve the execution
>         performance. So simply specifying the parallelism of a single
>         step will not work.
>
>         Thanks,
>         Cham
>
>         On Tue, May 15, 2018 at 11:21 AM Harshvardhan Agrawal
>         <harshvardhan.agr93@gmail.com
>         <ma...@gmail.com>> wrote:
>
>             Hi Guys,
>
>             I am currently in the process of developing a pipeline
>             using Apache Beam with Flink as an execution engine. As a
>             part of the process I read data from Kafka and perform a
>             bunch of transformations that involve joins, aggregations
>             as well as lookups to an external DB.
>
>             The idea is that we want to have higher parallelism with
>             Flink when we are performing the aggregations but
>             eventually coalesce the data and have lesser number of
>             processes writing to the DB so that the target DB can
>             handle it (for example say I want to have a parallelism of
>             40 for aggregations but only 10 when writing to target DB).
>
>             Is there any way we could do that in Beam?
>
>             Regards,
>
>             Harsh
>

Re: Controlling parallelism of a ParDo Transform while writing to DB

Posted by Chamikara Jayalath <ch...@google.com>.

Exact mechanism of controlling parallelism is runner specific. Looks like
Flink allows users to to specify the amount of parallelism (per job) using
following option.
https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineOptions.java#L65

I'm not sure if Flink allows more finer grained control.

On Wed, May 16, 2018 at 5:48 PM Harshvardhan Agrawal <
harshvardhan.agr93@gmail.com> wrote:

> How do we control parallelism of a particular step then? Is there a
> recommended approach to solve this problem?
>
> On Wed, May 16, 2018 at 20:45 Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> I don't think this can be specified through Beam API but Flink runner
>> might have additional configurations that I'm not aware of. Also, many
>> runners fuse steps to improve the execution performance. So simply
>> specifying the parallelism of a single step will not work.
>>
>> Thanks,
>> Cham
>>
>> On Tue, May 15, 2018 at 11:21 AM Harshvardhan Agrawal <
>> harshvardhan.agr93@gmail.com> wrote:
>>
>>> Hi Guys,
>>>
>>> I am currently in the process of developing a pipeline using Apache Beam
>>> with Flink as an execution engine. As a part of the process I read data
>>> from Kafka and perform a bunch of transformations that involve joins,
>>> aggregations as well as lookups to an external DB.
>>>
>>> The idea is that we want to have higher parallelism with Flink when we
>>> are performing the aggregations but eventually coalesce the data and have
>>> lesser number of processes writing to the DB so that the target DB can
>>> handle it (for example say I want to have a parallelism of 40 for
>>> aggregations but only 10 when writing to target DB).
>>>
>>> Is there any way we could do that in Beam?
>>>
>>> Regards,
>>>
>>> Harsh
>>>
>>

Re: Controlling parallelism of a ParDo Transform while writing to DB

Posted by Harshvardhan Agrawal <ha...@gmail.com>.

How do we control parallelism of a particular step then? Is there a
recommended approach to solve this problem?

On Wed, May 16, 2018 at 20:45 Chamikara Jayalath <ch...@google.com>
wrote:

> I don't think this can be specified through Beam API but Flink runner
> might have additional configurations that I'm not aware of. Also, many
> runners fuse steps to improve the execution performance. So simply
> specifying the parallelism of a single step will not work.
>
> Thanks,
> Cham
>
> On Tue, May 15, 2018 at 11:21 AM Harshvardhan Agrawal <
> harshvardhan.agr93@gmail.com> wrote:
>
>> Hi Guys,
>>
>> I am currently in the process of developing a pipeline using Apache Beam
>> with Flink as an execution engine. As a part of the process I read data
>> from Kafka and perform a bunch of transformations that involve joins,
>> aggregations as well as lookups to an external DB.
>>
>> The idea is that we want to have higher parallelism with Flink when we
>> are performing the aggregations but eventually coalesce the data and have
>> lesser number of processes writing to the DB so that the target DB can
>> handle it (for example say I want to have a parallelism of 40 for
>> aggregations but only 10 when writing to target DB).
>>
>> Is there any way we could do that in Beam?
>>
>> Regards,
>>
>> Harsh
>>
>

Re: Controlling parallelism of a ParDo Transform while writing to DB

Posted by Chamikara Jayalath <ch...@google.com>.

I don't think this can be specified through Beam API but Flink runner might
have additional configurations that I'm not aware of. Also, many runners
fuse steps to improve the execution performance. So simply specifying the
parallelism of a single step will not work.

Thanks,
Cham

On Tue, May 15, 2018 at 11:21 AM Harshvardhan Agrawal <
harshvardhan.agr93@gmail.com> wrote:

> Hi Guys,
>
> I am currently in the process of developing a pipeline using Apache Beam
> with Flink as an execution engine. As a part of the process I read data
> from Kafka and perform a bunch of transformations that involve joins,
> aggregations as well as lookups to an external DB.
>
> The idea is that we want to have higher parallelism with Flink when we are
> performing the aggregations but eventually coalesce the data and have
> lesser number of processes writing to the DB so that the target DB can
> handle it (for example say I want to have a parallelism of 40 for
> aggregations but only 10 when writing to target DB).
>
> Is there any way we could do that in Beam?
>
> Regards,
>
> Harsh
> --
>
> *Regards,Harshvardhan Agrawal*
> *267.991.6618 | LinkedIn <https://www.linkedin.com/in/harshvardhanagr/>*
>

Re: Controlling parallelism of a ParDo Transform while writing to DB

Posted by Raghu Angadi <ra...@google.com>.

If you want tight control on parallelism, you can 'reshuffle' the elements
into fixed number of shards. See
https://stackoverflow.com/questions/46116443/dataflow-streaming-job-not-scaleing-past-1-worker

On Tue, May 15, 2018 at 11:21 AM Harshvardhan Agrawal <
harshvardhan.agr93@gmail.com> wrote:

> Hi Guys,
>
> I am currently in the process of developing a pipeline using Apache Beam
> with Flink as an execution engine. As a part of the process I read data
> from Kafka and perform a bunch of transformations that involve joins,
> aggregations as well as lookups to an external DB.
>
> The idea is that we want to have higher parallelism with Flink when we are
> performing the aggregations but eventually coalesce the data and have
> lesser number of processes writing to the DB so that the target DB can
> handle it (for example say I want to have a parallelism of 40 for
> aggregations but only 10 when writing to target DB).
>
> Is there any way we could do that in Beam?
>
> Regards,
>
> Harsh
> --
>
> *Regards,Harshvardhan Agrawal*
> *267.991.6618 | LinkedIn <https://www.linkedin.com/in/harshvardhanagr/>*
>