You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Pedro Tuero <tu...@gmail.com> on 2022/04/21 18:43:29 UTC

Coalesce, parallelism, time, idle cores, spills...

Hi guys,
I was checking the times of a cluster and I noticed that one stage was
using only 100 cores because there were only 100 tasks.
Running on aws emr with this app conf:
[image: image.png]

Here are some details about the stage:
[image: image.png]

The cluster had more than 400 cores, but it only used 100 in this stage.
Looking at the code, I noticed there was a coalesce(100) after a
flatMapToPair.
This is the dag of the stage:

[image: image.png]


The coalesce has only one reason to be: reduce the quantity of output files.
So I did an experiment and removed the coalesce, hoping that there will be
more than 100 tasks and all the cores could be used.
The new dag:
[image: image.png]

And then I run the step with the same machines and input data (some minor
changes in data could happen).
But the time spended by the stage was the same, and there is spill memory
and disk:
[image: image.png]




The final output is the same in terms of size, but the new one has ~2000
parts instead of 100.

The first question is why on earth a job with more parallelism can have
more spills than one with less (in ceteris paribus conditions like this
case)??

Thanks,
Pedro.

Re: Coalesce, parallelism, time, idle cores, spills...

Posted by Pedro Tuero <tu...@gmail.com>.

Another related question is how to avoid that a coalesce changes the
quantity of tasks of a "previous" mapToPair, as described here:
https://stackoverflow.com/a/47504455
But I don't want/need to make a shuffle... The only objective of the
coalesce is to reduce the quantity of output files.
Is there a better strategy?

Thanks


El jue, 21 abr 2022 a la(s) 15:43, Pedro Tuero (tueropedro@gmail.com)
escribió:

>
> Hi guys,
> I was checking the times of a cluster and I noticed that one stage was
> using only 100 cores because there were only 100 tasks.
> Running on aws emr with this app conf:
> [image: image.png]
>
> Here are some details about the stage:
> [image: image.png]
>
> The cluster had more than 400 cores, but it only used 100 in this stage.
> Looking at the code, I noticed there was a coalesce(100) after a
> flatMapToPair.
> This is the dag of the stage:
>
> [image: image.png]
>
>
> The coalesce has only one reason to be: reduce the quantity of output
> files.
> So I did an experiment and removed the coalesce, hoping that there will be
> more than 100 tasks and all the cores could be used.
> The new dag:
> [image: image.png]
>
> And then I run the step with the same machines and input data (some minor
> changes in data could happen).
> But the time spended by the stage was the same, and there is spill memory
> and disk:
> [image: image.png]
>
>
>
>
> The final output is the same in terms of size, but the new one has ~2000
> parts instead of 100.
>
> The first question is why on earth a job with more parallelism can have
> more spills than one with less (in ceteris paribus conditions like this
> case)??
>
> Thanks,
> Pedro.
>
>
>
>