You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Alex Dzhagriev <dz...@gmail.com> on 2016/02/22 15:12:36 UTC

an OOM while persist as DISK_ONLY

Hello all,

I'm using spark 1.6 and trying to cache a dataset which is 1.5 TB, I have
only ~800GB RAM  in total, so I am choosing the DISK_ONLY storage level.
Unfortunately, I'm getting out of the overhead memory limit:


Container killed by YARN for exceeding memory limits. 27.0 GB of 27 GB
physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.


I'm giving 6GB overhead memory and using 10 cores per executor. Apparently,
that's not enough. Without persisting the data and later computing the
dataset (twice in my case) the job works fine. Can anyone, please, explain
what is the overhead which consumes that much memory during persist to the
disk and how can I estimate what extra memory should I give to the
executors in order to make it not fail?

Thanks, Alex.

Re: an OOM while persist as DISK_ONLY

Posted by Eugen Cepoi <ce...@gmail.com>.

We are in the process of upgrading to spark 1.6 from 1.4, and had a hard
time getting some of our more memory/join intensive jobs to work (rdd
caching + a lot of shuffling). Most of the time they were getting killed by
yarn.

Increasing the overhead was of course an option but the increase to make
the job pass was way higher than the overhead we had for spark 1.4, which
is way too much to be OK.

Playing with the configs above reduced the GC time but the problem still
persisted.

In the end it turned out we were hitting this issue
https://issues.apache.org/jira/browse/SPARK-12961.
What ended up working was to override the snappy version that comes with
EMR + disabling off-heap memory.

We still need to test the upgrade against our spark streaming jobs...
hopefully this issue https://issues.apache.org/jira/browse/SPARK-13288 is
also due to snappy...

Cheers,
Eugen


2016-03-03 16:14 GMT-08:00 Ted Yu <yu...@gmail.com>:

> bq. that solved some problems
>
> Is there any problem that was not solved by the tweak ?
>
> Thanks
>
> On Thu, Mar 3, 2016 at 4:11 PM, Eugen Cepoi <ce...@gmail.com> wrote:
>
>> You can limit the amount of memory spark will use for shuffle even in 1.6.
>> You can do that by tweaking the spark.memory.fraction and the
>> spark.storage.fraction. For example if you want to have no shuffle cache at
>> all you can set the storage.fraction to 1 or something close, to let a
>> small place for the shuffle cache. And then use the rest for storage, and
>> if you don't persist/broadcast data then you can reduce the whole
>> memory.fraction.
>>
>> Though not sure how good it is to tweak those values, as it assumes spark
>> is mostly using it for caching stuff... I have used similar tweaks in spark
>> 1.4 and tried it on spark 1.6 and that solved some problems...
>>
>> Eugen
>>
>> 2016-03-03 15:59 GMT-08:00 Andy Dang <na...@gmail.com>:
>>
>>> Spark shuffling algorithm is very aggressive in storing everything in
>>> RAM, and the behavior is worse in 1.6 with the UnifiedMemoryManagement. At
>>> least in previous versions you can limit the shuffler memory, but Spark 1.6
>>> will use as much memory as it can get. What I see is that Spark seems to
>>> underestimate the amount of memory that objects take up, and thus doesn't
>>> spill frequently enough. There's a dirty work around (legacy mode) but the
>>> common advice is to increase your parallelism (and keep in mind that
>>> operations such as join have implicit parallelism, so you'll want to be
>>> explicit about it).
>>>
>>> -------
>>> Regards,
>>> Andy
>>>
>>> On Mon, Feb 22, 2016 at 2:12 PM, Alex Dzhagriev <dz...@gmail.com>
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I'm using spark 1.6 and trying to cache a dataset which is 1.5 TB, I
>>>> have only ~800GB RAM  in total, so I am choosing the DISK_ONLY storage
>>>> level. Unfortunately, I'm getting out of the overhead memory limit:
>>>>
>>>>
>>>> Container killed by YARN for exceeding memory limits. 27.0 GB of 27 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
>>>>
>>>>
>>>> I'm giving 6GB overhead memory and using 10 cores per executor.
>>>> Apparently, that's not enough. Without persisting the data and later
>>>> computing the dataset (twice in my case) the job works fine. Can anyone,
>>>> please, explain what is the overhead which consumes that much memory during
>>>> persist to the disk and how can I estimate what extra memory should I give
>>>> to the executors in order to make it not fail?
>>>>
>>>> Thanks, Alex.
>>>>
>>>
>>>
>>
>

Re: an OOM while persist as DISK_ONLY

Posted by Ted Yu <yu...@gmail.com>.

bq. that solved some problems

Is there any problem that was not solved by the tweak ?

Thanks

On Thu, Mar 3, 2016 at 4:11 PM, Eugen Cepoi <ce...@gmail.com> wrote:

> You can limit the amount of memory spark will use for shuffle even in 1.6.
> You can do that by tweaking the spark.memory.fraction and the
> spark.storage.fraction. For example if you want to have no shuffle cache at
> all you can set the storage.fraction to 1 or something close, to let a
> small place for the shuffle cache. And then use the rest for storage, and
> if you don't persist/broadcast data then you can reduce the whole
> memory.fraction.
>
> Though not sure how good it is to tweak those values, as it assumes spark
> is mostly using it for caching stuff... I have used similar tweaks in spark
> 1.4 and tried it on spark 1.6 and that solved some problems...
>
> Eugen
>
> 2016-03-03 15:59 GMT-08:00 Andy Dang <na...@gmail.com>:
>
>> Spark shuffling algorithm is very aggressive in storing everything in
>> RAM, and the behavior is worse in 1.6 with the UnifiedMemoryManagement. At
>> least in previous versions you can limit the shuffler memory, but Spark 1.6
>> will use as much memory as it can get. What I see is that Spark seems to
>> underestimate the amount of memory that objects take up, and thus doesn't
>> spill frequently enough. There's a dirty work around (legacy mode) but the
>> common advice is to increase your parallelism (and keep in mind that
>> operations such as join have implicit parallelism, so you'll want to be
>> explicit about it).
>>
>> -------
>> Regards,
>> Andy
>>
>> On Mon, Feb 22, 2016 at 2:12 PM, Alex Dzhagriev <dz...@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> I'm using spark 1.6 and trying to cache a dataset which is 1.5 TB, I
>>> have only ~800GB RAM  in total, so I am choosing the DISK_ONLY storage
>>> level. Unfortunately, I'm getting out of the overhead memory limit:
>>>
>>>
>>> Container killed by YARN for exceeding memory limits. 27.0 GB of 27 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
>>>
>>>
>>> I'm giving 6GB overhead memory and using 10 cores per executor.
>>> Apparently, that's not enough. Without persisting the data and later
>>> computing the dataset (twice in my case) the job works fine. Can anyone,
>>> please, explain what is the overhead which consumes that much memory during
>>> persist to the disk and how can I estimate what extra memory should I give
>>> to the executors in order to make it not fail?
>>>
>>> Thanks, Alex.
>>>
>>
>>
>

Re: an OOM while persist as DISK_ONLY

Posted by Eugen Cepoi <ce...@gmail.com>.

You can limit the amount of memory spark will use for shuffle even in 1.6.
You can do that by tweaking the spark.memory.fraction and the
spark.storage.fraction. For example if you want to have no shuffle cache at
all you can set the storage.fraction to 1 or something close, to let a
small place for the shuffle cache. And then use the rest for storage, and
if you don't persist/broadcast data then you can reduce the whole
memory.fraction.

Though not sure how good it is to tweak those values, as it assumes spark
is mostly using it for caching stuff... I have used similar tweaks in spark
1.4 and tried it on spark 1.6 and that solved some problems...

Eugen

2016-03-03 15:59 GMT-08:00 Andy Dang <na...@gmail.com>:

> Spark shuffling algorithm is very aggressive in storing everything in RAM,
> and the behavior is worse in 1.6 with the UnifiedMemoryManagement. At least
> in previous versions you can limit the shuffler memory, but Spark 1.6 will
> use as much memory as it can get. What I see is that Spark seems to
> underestimate the amount of memory that objects take up, and thus doesn't
> spill frequently enough. There's a dirty work around (legacy mode) but the
> common advice is to increase your parallelism (and keep in mind that
> operations such as join have implicit parallelism, so you'll want to be
> explicit about it).
>
> -------
> Regards,
> Andy
>
> On Mon, Feb 22, 2016 at 2:12 PM, Alex Dzhagriev <dz...@gmail.com> wrote:
>
>> Hello all,
>>
>> I'm using spark 1.6 and trying to cache a dataset which is 1.5 TB, I have
>> only ~800GB RAM  in total, so I am choosing the DISK_ONLY storage level.
>> Unfortunately, I'm getting out of the overhead memory limit:
>>
>>
>> Container killed by YARN for exceeding memory limits. 27.0 GB of 27 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
>>
>>
>> I'm giving 6GB overhead memory and using 10 cores per executor.
>> Apparently, that's not enough. Without persisting the data and later
>> computing the dataset (twice in my case) the job works fine. Can anyone,
>> please, explain what is the overhead which consumes that much memory during
>> persist to the disk and how can I estimate what extra memory should I give
>> to the executors in order to make it not fail?
>>
>> Thanks, Alex.
>>
>
>

Re: an OOM while persist as DISK_ONLY

Posted by Andy Dang <na...@gmail.com>.

Spark shuffling algorithm is very aggressive in storing everything in RAM,
and the behavior is worse in 1.6 with the UnifiedMemoryManagement. At least
in previous versions you can limit the shuffler memory, but Spark 1.6 will
use as much memory as it can get. What I see is that Spark seems to
underestimate the amount of memory that objects take up, and thus doesn't
spill frequently enough. There's a dirty work around (legacy mode) but the
common advice is to increase your parallelism (and keep in mind that
operations such as join have implicit parallelism, so you'll want to be
explicit about it).

-------
Regards,
Andy

On Mon, Feb 22, 2016 at 2:12 PM, Alex Dzhagriev <dz...@gmail.com> wrote:

> Hello all,
>
> I'm using spark 1.6 and trying to cache a dataset which is 1.5 TB, I have
> only ~800GB RAM  in total, so I am choosing the DISK_ONLY storage level.
> Unfortunately, I'm getting out of the overhead memory limit:
>
>
> Container killed by YARN for exceeding memory limits. 27.0 GB of 27 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
>
>
> I'm giving 6GB overhead memory and using 10 cores per executor.
> Apparently, that's not enough. Without persisting the data and later
> computing the dataset (twice in my case) the job works fine. Can anyone,
> please, explain what is the overhead which consumes that much memory during
> persist to the disk and how can I estimate what extra memory should I give
> to the executors in order to make it not fail?
>
> Thanks, Alex.
>