You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Gorjan Todorovski <go...@gmail.com> on 2021/08/13 10:48:33 UTC

Scaling Flink for batch jobs

Hi!

I want to implement a Flink cluster as a native Kubernetes session cluster,
with intention of executing Apache Beam jobs that will process only batch
data, but I am not sure I understand how I would scale the cluster if I
need to process large datasets.

My understanding is that to be able to process a bigger dataset, you could
run it with higher parallelism, so the processing will be spread on
multiple task slots, which might run multiple nodes.
But running Beam jobs which actually in my case execute TensorFlow Extended
pipelines, I am not able to have control over partitioning over some keys
and I don't see any difference in throughput (the time it takes to process
specific dataset), if I use parallelism of 2 or 4 - it takes the same time.

Also, does it mean if I want to process a dataset of any size since the
execution is of type "PIPELINED", does this mean, if I don't increase
parallelism and just run the job on a fixed number of task slots, the job
will fail (due to lack of memory on the task manager)or it will just take
longer time to process the data?

Thanks,
Gorjan

Re: Scaling Flink for batch jobs

Posted by Gorjan Todorovski <go...@gmail.com>.

Thanks, I'll check more about job tuning.

On Mon, 16 Aug 2021 at 06:28, Caizhi Weng <ts...@gmail.com> wrote:

> Hi!
>
> if I use parallelism of 2 or 4 - it takes the same time.
>>
> It might be that there is no data in some parallelisms. You can click on
> the nodes in Flink web UI and see if it is the case for each parallelism,
> or you can check out the metrics of each operator.
>
> if I don't increase parallelism and just run the job on a fixed number of
>> task slots, the job will fail (due to lack of memory on the task manager)or
>> it will just take longer time to process the data?
>>
> It depends on a lot of aspects, such as the type of source you are using,
> the type of operators you are running, etc. Ideally we hope it will just
> take longer but for some specific operators or connectors it might fail.
> This is where users have to tune their jobs.
>
> Gorjan Todorovski <go...@gmail.com> 于2021年8月13日周五 下午6:48写道：
>
>> Hi!
>>
>> I want to implement a Flink cluster as a native Kubernetes session
>> cluster, with intention of executing Apache Beam jobs that will process
>> only batch data, but I am not sure I understand how I would scale the
>> cluster if I need to process large datasets.
>>
>> My understanding is that to be able to process a bigger dataset, you
>> could run it with higher parallelism, so the processing will be spread on
>> multiple task slots, which might run multiple nodes.
>> But running Beam jobs which actually in my case execute TensorFlow
>> Extended pipelines, I am not able to have control over partitioning over
>> some keys and I don't see any difference in throughput (the time it takes
>> to process specific dataset), if I use parallelism of 2 or 4 - it takes the
>> same time.
>>
>> Also, does it mean if I want to process a dataset of any size since the
>> execution is of type "PIPELINED", does this mean, if I don't increase
>> parallelism and just run the job on a fixed number of task slots, the job
>> will fail (due to lack of memory on the task manager)or it will just take
>> longer time to process the data?
>>
>> Thanks,
>> Gorjan
>>
>

Re: Scaling Flink for batch jobs

Posted by Caizhi Weng <ts...@gmail.com>.

Hi!

if I use parallelism of 2 or 4 - it takes the same time.
>
It might be that there is no data in some parallelisms. You can click on
the nodes in Flink web UI and see if it is the case for each parallelism,
or you can check out the metrics of each operator.

if I don't increase parallelism and just run the job on a fixed number of
> task slots, the job will fail (due to lack of memory on the task manager)or
> it will just take longer time to process the data?
>
It depends on a lot of aspects, such as the type of source you are using,
the type of operators you are running, etc. Ideally we hope it will just
take longer but for some specific operators or connectors it might fail.
This is where users have to tune their jobs.

Gorjan Todorovski <go...@gmail.com> 于2021年8月13日周五 下午6:48写道：

> Hi!
>
> I want to implement a Flink cluster as a native Kubernetes session
> cluster, with intention of executing Apache Beam jobs that will process
> only batch data, but I am not sure I understand how I would scale the
> cluster if I need to process large datasets.
>
> My understanding is that to be able to process a bigger dataset, you could
> run it with higher parallelism, so the processing will be spread on
> multiple task slots, which might run multiple nodes.
> But running Beam jobs which actually in my case execute TensorFlow
> Extended pipelines, I am not able to have control over partitioning over
> some keys and I don't see any difference in throughput (the time it takes
> to process specific dataset), if I use parallelism of 2 or 4 - it takes the
> same time.
>
> Also, does it mean if I want to process a dataset of any size since the
> execution is of type "PIPELINED", does this mean, if I don't increase
> parallelism and just run the job on a fixed number of task slots, the job
> will fail (due to lack of memory on the task manager)or it will just take
> longer time to process the data?
>
> Thanks,
> Gorjan
>