You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by André Missaglia <an...@arquivei.com.br> on 2020/06/26 17:26:51 UTC

Processing data of different sizes

Hi everyone

I'm building a pipeline where I group the elements and then execute a
CPU-intensive function on each group. This function performs a statistical
analysis over the elements, only to return a single value on the end.

But because each group has a different amount of elements, some groups are
processed really quickly, others may take up to 30min to run. The problem
is that the pipeline processes 99% of the groups in a couple of minutes,
but then spends another 2 hours processing the big groups. The image below
illustrates what I mean:
[image: image.png]



Even worse than that, if I use for example, 20 dataflow instances with 32
cores, and the big groups end up each on different machines, I'm gonna pay
for all those instances while the job isn't done.

I know that one optimization would be to split the groups into
equally-sized groups, but I'm not sure that is possible in this case given
the calculation I'm performing.

So I was thinking, is there any way I can "tell" the runner how long I
think the DoFn is going to run, so that it can do a better job scheduling
those elements?

Thanks!

-- 
*André Badawi Missaglia*
Data Engineer
(16) 3509-5515 *|* www.arquivei.com.br
<https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
[image: Arquivei.com.br – Inteligência em Notas Fiscais]
<https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
[image: Google seleciona Arquivei para imersão e mentoria no Vale do
Silício]
<https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad>
<https://www.facebook.com/arquivei>
<https://www.linkedin.com/company/arquivei>
<https://www.youtube.com/channel/UC6mrvVc7b5mA2SyYPEVg8cw>

Re: Processing data of different sizes

Posted by Luke Cwik <lc...@google.com>.

There is currently no way to tell the runner how long something is expected
to take.

Splitting the groups into even sized groups with an aggregation that
happens further in the pipeline of those partial groups will work best.
Alternatively, you could run a different pipeline for "different" sized
groups (one for the small groups and one for the large groups).

FInally, there is a feature called splittable DoFns which allows you to
"pause" processing. This does allow runners to relocate work to other
machines allowing for downscaling of poorly utilized machines. This is not
yet available to Beam Java users using Dataflow though.

On Fri, Jun 26, 2020 at 10:27 AM André Missaglia <
andre.missaglia@arquivei.com.br> wrote:

> Hi everyone
>
> I'm building a pipeline where I group the elements and then execute a
> CPU-intensive function on each group. This function performs a statistical
> analysis over the elements, only to return a single value on the end.
>
> But because each group has a different amount of elements, some groups are
> processed really quickly, others may take up to 30min to run. The problem
> is that the pipeline processes 99% of the groups in a couple of minutes,
> but then spends another 2 hours processing the big groups. The image below
> illustrates what I mean:
> [image: image.png]
>
>
>
> Even worse than that, if I use for example, 20 dataflow instances with 32
> cores, and the big groups end up each on different machines, I'm gonna pay
> for all those instances while the job isn't done.
>
> I know that one optimization would be to split the groups into
> equally-sized groups, but I'm not sure that is possible in this case given
> the calculation I'm performing.
>
> So I was thinking, is there any way I can "tell" the runner how long I
> think the DoFn is going to run, so that it can do a better job scheduling
> those elements?
>
> Thanks!
>
> --
> *André Badawi Missaglia*
> Data Engineer
> (16) 3509-5515 *|* www.arquivei.com.br
> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
> [image: Google seleciona Arquivei para imersão e mentoria no Vale do
> Silício]
> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad>
> <https://www.facebook.com/arquivei>
> <https://www.linkedin.com/company/arquivei>
> <https://www.youtube.com/channel/UC6mrvVc7b5mA2SyYPEVg8cw>
>

Re: Processing data of different sizes

Posted by Reuven Lax <re...@google.com>.

How are you grouping the elements today?

On Fri, Jun 26, 2020 at 10:27 AM André Missaglia <
andre.missaglia@arquivei.com.br> wrote:

> Hi everyone
>
> I'm building a pipeline where I group the elements and then execute a
> CPU-intensive function on each group. This function performs a statistical
> analysis over the elements, only to return a single value on the end.
>
> But because each group has a different amount of elements, some groups are
> processed really quickly, others may take up to 30min to run. The problem
> is that the pipeline processes 99% of the groups in a couple of minutes,
> but then spends another 2 hours processing the big groups. The image below
> illustrates what I mean:
> [image: image.png]
>
>
>
> Even worse than that, if I use for example, 20 dataflow instances with 32
> cores, and the big groups end up each on different machines, I'm gonna pay
> for all those instances while the job isn't done.
>
> I know that one optimization would be to split the groups into
> equally-sized groups, but I'm not sure that is possible in this case given
> the calculation I'm performing.
>
> So I was thinking, is there any way I can "tell" the runner how long I
> think the DoFn is going to run, so that it can do a better job scheduling
> those elements?
>
> Thanks!
>
> --
> *André Badawi Missaglia*
> Data Engineer
> (16) 3509-5515 *|* www.arquivei.com.br
> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
> [image: Google seleciona Arquivei para imersão e mentoria no Vale do
> Silício]
> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad>
> <https://www.facebook.com/arquivei>
> <https://www.linkedin.com/company/arquivei>
> <https://www.youtube.com/channel/UC6mrvVc7b5mA2SyYPEVg8cw>
>