You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by babloo80 <ba...@gmail.com> on 2016/01/06 04:44:11 UTC

Out of memory issue

Hello there,

I have a spark job reads 7 parquet files (8 GB, 3 x 16 GB, 3 x 14 GB) in
different stages of execution and creates a result parquet of 9 GB (about 27
million rows containing 165 columns. some columns are map based containing
utmost 200 value histograms). The stages involve,
Step 1: Reading the data using dataframe api 
Step 2: Transform dataframe to RDD (as the some of the columns are
transformed into histograms (using empirical distribution to cap the number
of keys) and some of them run like UDAF during reduce-by-key step) to
perform and perform some transformations 
Step 3: Reduce the result by key so that the resultant can be used in the
next stage for join
Step 4: Perform left outer join of this result which runs similar Steps 1
thru 3. 
Step 5: The results are further reduced to be written to parquet

With Apache Spark 1.5.2, I am able to run the job with no issues.
Current env uses 8 nodes running a total of  320 cores, 100 GB executor
memory per node with driver program using 32 GB. The approximate execution
time is about 1.2 hrs. The parquet files are stored in another HDFS cluster
for read and eventual write of the result.

When the same job is executed using Apache 1.6.0, some of the executor
node's JVM gets restarted (with a new executor id). On further turning-on GC
stats on the executor, the perm-gen seem to get maxed out and ends up
showing the symptom of out-of-memory. 

Please advice on where to start investigating this issue. 

Thanks,
Muthu



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-issue-tp25888.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Out of memory issue

Posted by Muthu Jayakumar <ba...@gmail.com>.
Thanks Ewan Leith. This seems like a good start, as it seem to match up to
the symptoms I am seeing :).

But, how do I specify "parquet.memory.pool.ratio"?
Parquet code seem to take this parameter from
ParquetOutputFormat.getRecordWriter()
(ref code: float maxLoadconf.getFloat(ParquetOutputFormat.MEMORY_POOL_RATIO,
MemoryManager.DEFAULT_MEMORY_POOL_RATIO);).
I wonder how is this provided thru Apache Spark. Meaning, I see that
'TaskAttemptContext' seems to be the hint to provide this. But I am not
able to find a way I could provide this configuration.

Please advice,
Muthu

On Wed, Jan 6, 2016 at 1:57 AM, Ewan Leith <ew...@realitymine.com>
wrote:

> Hi Muthu, this could be related to a known issue in the release notes
>
> http://spark.apache.org/releases/spark-release-1-6-0.html
>
> Known issues
>
>     SPARK-12546 -  Save DataFrame/table as Parquet with dynamic partitions
> may cause OOM; this can be worked around by decreasing the memory used by
> both Spark and Parquet using spark.memory.fraction (for example, 0.4) and
> parquet.memory.pool.ratio (for example, 0.3, in Hadoop configuration, e.g.
> setting it in core-site.xml).
>
> It's definitely worth setting spark.memory.fraction and
> parquet.memory.pool.ratio and trying again.
>
> Ewan
>
> -----Original Message-----
> From: babloo80 [mailto:babloo80@gmail.com]
> Sent: 06 January 2016 03:44
> To: user@spark.apache.org
> Subject: Out of memory issue
>
> Hello there,
>
> I have a spark job reads 7 parquet files (8 GB, 3 x 16 GB, 3 x 14 GB) in
> different stages of execution and creates a result parquet of 9 GB (about
> 27 million rows containing 165 columns. some columns are map based
> containing utmost 200 value histograms). The stages involve, Step 1:
> Reading the data using dataframe api Step 2: Transform dataframe to RDD (as
> the some of the columns are transformed into histograms (using empirical
> distribution to cap the number of keys) and some of them run like UDAF
> during reduce-by-key step) to perform and perform some transformations Step
> 3: Reduce the result by key so that the resultant can be used in the next
> stage for join Step 4: Perform left outer join of this result which runs
> similar Steps 1 thru 3.
> Step 5: The results are further reduced to be written to parquet
>
> With Apache Spark 1.5.2, I am able to run the job with no issues.
> Current env uses 8 nodes running a total of  320 cores, 100 GB executor
> memory per node with driver program using 32 GB. The approximate execution
> time is about 1.2 hrs. The parquet files are stored in another HDFS cluster
> for read and eventual write of the result.
>
> When the same job is executed using Apache 1.6.0, some of the executor
> node's JVM gets restarted (with a new executor id). On further turning-on
> GC stats on the executor, the perm-gen seem to get maxed out and ends up
> showing the symptom of out-of-memory.
>
> Please advice on where to start investigating this issue.
>
> Thanks,
> Muthu
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-issue-tp25888.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional
> commands, e-mail: user-help@spark.apache.org
>
>

RE: Out of memory issue

Posted by Ewan Leith <ew...@realitymine.com>.
Hi Muthu, this could be related to a known issue in the release notes

http://spark.apache.org/releases/spark-release-1-6-0.html

Known issues

    SPARK-12546 -  Save DataFrame/table as Parquet with dynamic partitions may cause OOM; this can be worked around by decreasing the memory used by both Spark and Parquet using spark.memory.fraction (for example, 0.4) and parquet.memory.pool.ratio (for example, 0.3, in Hadoop configuration, e.g. setting it in core-site.xml).

It's definitely worth setting spark.memory.fraction and parquet.memory.pool.ratio and trying again.

Ewan

-----Original Message-----
From: babloo80 [mailto:babloo80@gmail.com] 
Sent: 06 January 2016 03:44
To: user@spark.apache.org
Subject: Out of memory issue

Hello there,

I have a spark job reads 7 parquet files (8 GB, 3 x 16 GB, 3 x 14 GB) in different stages of execution and creates a result parquet of 9 GB (about 27 million rows containing 165 columns. some columns are map based containing utmost 200 value histograms). The stages involve, Step 1: Reading the data using dataframe api Step 2: Transform dataframe to RDD (as the some of the columns are transformed into histograms (using empirical distribution to cap the number of keys) and some of them run like UDAF during reduce-by-key step) to perform and perform some transformations Step 3: Reduce the result by key so that the resultant can be used in the next stage for join Step 4: Perform left outer join of this result which runs similar Steps 1 thru 3. 
Step 5: The results are further reduced to be written to parquet

With Apache Spark 1.5.2, I am able to run the job with no issues.
Current env uses 8 nodes running a total of  320 cores, 100 GB executor memory per node with driver program using 32 GB. The approximate execution time is about 1.2 hrs. The parquet files are stored in another HDFS cluster for read and eventual write of the result.

When the same job is executed using Apache 1.6.0, some of the executor node's JVM gets restarted (with a new executor id). On further turning-on GC stats on the executor, the perm-gen seem to get maxed out and ends up showing the symptom of out-of-memory. 

Please advice on where to start investigating this issue. 

Thanks,
Muthu



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-issue-tp25888.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org