You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by 陈哲 <cz...@gmail.com> on 2016/10/12 07:24:28 UTC

Spark ML OOM problem

Hi
    I'm using spark ml to train RandomForest Model . There is about over 200,
000 lines in the training data file  and about 100 features. I'm running
spark in local mode and with JAVA_OPTS like: -Xms1024m -Xmx10296m
 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps, but OOM error keep coming out,
I tried with spark configuration change to avoid this but failed.
My spark conf:

*spark.memory.fraction:0.85*

*spark.executor.instances: 16*

*spark.executor.heartbeatInterval:120   *
*spark.driver.maxResultSize:0*
spark.ui.retainedJobs=0 ...  // I put all the similar conf to 0
, spark.ui.retainedStages ...

Here is some GC log:
4876.028: [Full GC [PSYoungGen: 439296K->417358K(878592K)] [ParOldGen:
9139497K->9139487K(9225216K)] 9578793K->9556845K(10103808K) [PSPermGen:
81436K->81436K(81920K)], 2.0203540 secs] [Times: user=49.85 sys=0.12,
real=2.02 secs]
4878.100: [Full GC [PSYoungGen: 417930K->187983K(878592K)] [ParOldGen:
9139487K->9166111K(9225216K)] 9557418K->9354094K(10103808K) [PSPermGen:
81436K->81436K(81920K)], 4.2368530 secs] [Times: user=53.77 sys=0.09,
real=4.23 secs]
4882.414: [Full GC [PSYoungGen: 428018K->202158K(878592K)] [ParOldGen:
9211167K->9196569K(9225216K)] 9639185K->9398727K(10103808K) [PSPermGen:
81436K->81436K(81920K)], 4.4419950 secs] [Times: user=54.75 sys=0.08,
real=4.44 secs]
4886.886: [Full GC [PSYoungGen: 425657K->397128K(878592K)] [ParOldGen:
9196569K->9196568K(9225216K)] 9622227K->9593697K(10103808K) [PSPermGen:
81436K->81436K(81920K)], 2.3522140 secs] [Times: user=51.41 sys=0.09,
real=2.35 secs]
4889.239: [Full GC [PSYoungGen: 397128K->397128K(878592K)] [ParOldGen:
9196568K->9196443K(9225216K)] 9593697K->9593572K(10103808K) [PSPermGen:
81436K->81289K(81408K)], 30.5637160 secs] [Times: user=767.29 sys=2.98,
real=30.57 secs]

the Full GC failed to collect enough memory , so OOM

I have two questions :
1. why spark log always show free memory like:
4834409 [dispatcher-event-loop-5] INFO
org.apache.spark.storage.BlockManagerInfo - Removed broadcast_266_piece0 on
172.17.1.235:9948 in memory (size: 642.5 KB, free: 7.1 GB)
Is this Wrong ?

2. How to avoid OOM here ? do I have to increase -Xmx to large value ? How
does spark use these memory , what's in those memory ? anyone can guide to
some docs ?


Thanks

Patrick

Re: Spark ML OOM problem

Posted by Jörn Franke <jo...@gmail.com>.
Which Spark version? 
Are you using RDDs? Or datasets?
What type are the features? If string how large? 
Is it spark standalone? 
How do you train/configure the algorithm. How do you initially parse the data?
The standard driver and executor logs could be helpful. 


> On 12 Oct 2016, at 09:24, 陈哲 <cz...@gmail.com> wrote:
> 
> Hi 
>     I'm using spark ml to train RandomForest Model . There is about over 200, 000 lines in the training data file  and about 100 features. I'm running spark in local mode and with JAVA_OPTS like: -Xms1024m -Xmx10296m  -XX:+PrintGCDetails -XX:+PrintGCTimeStamps, but OOM error keep coming out, I tried with spark configuration change to avoid this but failed.
> My spark conf:
> spark.memory.fraction:0.85
> spark.executor.instances: 16
> spark.executor.heartbeatInterval:120   
> spark.driver.maxResultSize:0
> spark.ui.retainedJobs=0 ...  // I put all the similar conf to 0 , spark.ui.retainedStages ... 
> 
> Here is some GC log:
> 4876.028: [Full GC [PSYoungGen: 439296K->417358K(878592K)] [ParOldGen: 9139497K->9139487K(9225216K)] 9578793K->9556845K(10103808K) [PSPermGen: 81436K->81436K(81920K)], 2.0203540 secs] [Times: user=49.85 sys=0.12, real=2.02 secs] 
> 4878.100: [Full GC [PSYoungGen: 417930K->187983K(878592K)] [ParOldGen: 9139487K->9166111K(9225216K)] 9557418K->9354094K(10103808K) [PSPermGen: 81436K->81436K(81920K)], 4.2368530 secs] [Times: user=53.77 sys=0.09, real=4.23 secs] 
> 4882.414: [Full GC [PSYoungGen: 428018K->202158K(878592K)] [ParOldGen: 9211167K->9196569K(9225216K)] 9639185K->9398727K(10103808K) [PSPermGen: 81436K->81436K(81920K)], 4.4419950 secs] [Times: user=54.75 sys=0.08, real=4.44 secs] 
> 4886.886: [Full GC [PSYoungGen: 425657K->397128K(878592K)] [ParOldGen: 9196569K->9196568K(9225216K)] 9622227K->9593697K(10103808K) [PSPermGen: 81436K->81436K(81920K)], 2.3522140 secs] [Times: user=51.41 sys=0.09, real=2.35 secs] 
> 4889.239: [Full GC [PSYoungGen: 397128K->397128K(878592K)] [ParOldGen: 9196568K->9196443K(9225216K)] 9593697K->9593572K(10103808K) [PSPermGen: 81436K->81289K(81408K)], 30.5637160 secs] [Times: user=767.29 sys=2.98, real=30.57 secs] 
> 
> the Full GC failed to collect enough memory , so OOM 
> 
> I have two questions : 
> 1. why spark log always show free memory like: 
> 4834409 [dispatcher-event-loop-5] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_266_piece0 on 172.17.1.235:9948 in memory (size: 642.5 KB, free: 7.1 GB)
> Is this Wrong ?  
> 
> 2. How to avoid OOM here ? do I have to increase -Xmx to large value ? How does spark use these memory , what's in those memory ? anyone can guide to some docs ? 
> 
> 
> Thanks 
> 
> Patrick