You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Yanbo Liang <yb...@gmail.com> on 2016/01/01 16:42:02 UTC

Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

Hi Jia,

I think the examples you provided is not very suitable to illustrate what
driver and executors do, because it's not show the internal implementation
of the KMeans algorithm.
You can refer the source code of MLlib Kmeans (
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L227
).
In short words, the driver need the memory of O(centers size) but each
executors need the memory of O(partition size). Usually we have large
dataset and distributed the whole dataset at many executors, but the
centers is not very big even compared with the dataset at one executor.

Cheers
Yanbo

2015-12-31 22:31 GMT+08:00 Jia Zou <ja...@gmail.com>:

> Thanks, Yanbo.
> The results become much more reasonable, after I set driver memory to 5GB
> and increase worker memory to 25GB.
>
> So, my question is for following code snippet extracted from main method
> in JavaKMeans.java in examples, what will the driver do? and what will the
> worker do?
>
> I didn't understand this problem well by reading
> https://spark.apache.org/docs/1.1.0/cluster-overview.htmland
> http://stackoverflow.com/questions/27181737/how-to-deal-with-executor-memory-and-driver-memory-in-spark
>
>     SparkConf sparkConf = new SparkConf().setAppName("JavaKMeans");
>
>     JavaSparkContext sc = new JavaSparkContext(sparkConf);
>
>     JavaRDD<String> lines = sc.textFile(inputFile);
>
>     JavaRDD<Vector> points = lines.map(new ParsePoint());
>
>      points.persist(StorageLevel.MEMORY_AND_DISK());
>
>     KMeansModel model = KMeans.train(points.rdd(), k, iterations, runs,
> KMeans.K_MEANS_PARALLEL());
>
>
> Thank you very much!
>
> Best Regards,
> Jia
>
> On Wed, Dec 30, 2015 at 9:00 PM, Yanbo Liang <yb...@gmail.com> wrote:
>
>> Hi Jia,
>>
>> You can try to use inputRDD.persist(MEMORY_AND_DISK) and verify whether
>> it can produce stable performance. The storage level of MEMORY_AND_DISK
>> will store the partitions that don't fit on disk and read them from there
>> when they are needed.
>> Actually, it's not necessary to set so large driver memory in your case,
>> because KMeans use low memory for driver if your k is not very large.
>>
>> Cheers
>> Yanbo
>>
>> 2015-12-30 22:20 GMT+08:00 Jia Zou <ja...@gmail.com>:
>>
>>> I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8
>>> CPU cores and 30GB memory. Executor memory is set to 15GB, and driver
>>> memory is set to 15GB.
>>>
>>> The observation is that, when input data size is smaller than 15GB, the
>>> performance is quite stable. However, when input data becomes larger than
>>> that, the performance will be extremely unpredictable. For example, for
>>> 15GB input, with inputRDD.persist(MEMORY_ONLY) , I've got three
>>> dramatically different testing results: 27mins, 61mins and 114 mins. (All
>>> settings are the same for the 3 tests, and I will create input data
>>> immediately before running each of the tests to keep OS buffer cache hot.)
>>>
>>> Anyone can help to explain this? Thanks very much!
>>>
>>>
>>
>