You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by David Diebold <da...@gmail.com> on 2021/12/14 18:00:37 UTC

question about data skew and memory issues

Hello all,

I was wondering if it possible to encounter out of memory exceptions on
spark executors when doing some aggregation, when a dataset is skewed.
Let's say we have a dataset with two columns:
- key : int
- value : float
And I want to aggregate values by key.
Let's say that we have a tons of key equal to 0.

Is it possible to encounter some out of memory exception after the shuffle ?
My expectation would be that the executor responsible of aggregating the
'0' partition could indeed have some oom exception if it tries to put all
the files of this partition in memory before processing them.
But why would it need to put them in memory when doing in aggregation ? It
looks to me that aggregation can be performed in a stream fashion, so I
would not expect any oom at all..

Thank you in advance for your lights :)
David

Re: question about data skew and memory issues

Posted by Gourav Sengupta <go...@gmail.com>.
Hi,
also if you are using SPARK 3.2.x please try to see the documentation on
handling skew using SPARK settings.

Regards,
Gourav Sengupta

On Tue, Dec 14, 2021 at 6:01 PM David Diebold <da...@gmail.com>
wrote:

> Hello all,
>
> I was wondering if it possible to encounter out of memory exceptions on
> spark executors when doing some aggregation, when a dataset is skewed.
> Let's say we have a dataset with two columns:
> - key : int
> - value : float
> And I want to aggregate values by key.
> Let's say that we have a tons of key equal to 0.
>
> Is it possible to encounter some out of memory exception after the shuffle
> ?
> My expectation would be that the executor responsible of aggregating the
> '0' partition could indeed have some oom exception if it tries to put all
> the files of this partition in memory before processing them.
> But why would it need to put them in memory when doing in aggregation ? It
> looks to me that aggregation can be performed in a stream fashion, so I
> would not expect any oom at all..
>
> Thank you in advance for your lights :)
> David
>
>
>

Re: question about data skew and memory issues

Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi david,

Can you give us the example of code you are running and the way you are
aggregating over keys?

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 14 Dec 2021 at 18:01, David Diebold <da...@gmail.com> wrote:

> Hello all,
>
> I was wondering if it possible to encounter out of memory exceptions on
> spark executors when doing some aggregation, when a dataset is skewed.
> Let's say we have a dataset with two columns:
> - key : int
> - value : float
> And I want to aggregate values by key.
> Let's say that we have a tons of key equal to 0.
>
> Is it possible to encounter some out of memory exception after the shuffle
> ?
> My expectation would be that the executor responsible of aggregating the
> '0' partition could indeed have some oom exception if it tries to put all
> the files of this partition in memory before processing them.
> But why would it need to put them in memory when doing in aggregation ? It
> looks to me that aggregation can be performed in a stream fashion, so I
> would not expect any oom at all..
>
> Thank you in advance for your lights :)
> David
>
>
>