You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Koert Kuipers <ko...@tresata.com> on 2016/11/23 18:53:11 UTC

spark sql jobs heap memory

we are testing Dataset/Dataframe jobs instead of RDD jobs. one thing we
keep running into is containers getting killed by yarn. i realize this has
to do with off-heap memory, and the suggestion is to increase
spark.yarn.executor.memoryOverhead.

at times our memoryOverhead is as large as the executor memory (say 4G and
4G).

why is Dataset/Dataframe using so much off heap memory?

we havent changed spark.memory.offHeap.enabled which defaults to false.
should we enable that to get a better handle on this?

Re: spark sql jobs heap memory

Posted by Rohit Karlupia <ro...@qubole.com>.
Dataset/dataframes will use direct/raw/off-heap memory in the most
efficient columnar fashion. Trying to fit the same amount of data in heap
memory would likely increase your memory requirement and decrease the
speed.

So, in short, don't worry about it and increase overhead. You can also set
a bound on off heap memory via some option.

thanks,
rohitk

On Thu, Nov 24, 2016 at 12:23 AM, Koert Kuipers <ko...@tresata.com> wrote:

> we are testing Dataset/Dataframe jobs instead of RDD jobs. one thing we
> keep running into is containers getting killed by yarn. i realize this has
> to do with off-heap memory, and the suggestion is to increase
> spark.yarn.executor.memoryOverhead.
>
> at times our memoryOverhead is as large as the executor memory (say 4G and
> 4G).
>
> why is Dataset/Dataframe using so much off heap memory?
>
> we havent changed spark.memory.offHeap.enabled which defaults to false.
> should we enable that to get a better handle on this?
>