You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by kmurph <k....@qub.ac.uk> on 2016/05/04 14:21:22 UTC

Spark MLLib benchmarks

Hi, 

I'm benchmarking Spark(1.6) and MLLib TF-IDF (with hdfs) on a 20GB dataset,
and not seeing much scale-up when I increase cores/executors/RAM according
to Spark tuning documentation.  I suspect I'm missing a trick in my
configuration.

I'm running on shared memory (96 cores, 256GB RAM) and testing various
combinations of:
Number of executors (1,2,4,8)
Number of cores per executor (1,2,4,8,12,24)
Memory per executor (calculated as per cloudera recommendations)
Of course in line with combined resource limits.

Also setting the RDD partitioning number to 2,4,6,8  (I see best results at
4 partitions, about 5% better than worse case).

Have also varied/switched the following settings:
Using the Kyro Serialiser
Setting driver memory
Setting for compressed ops
Dynamic scheduling
trying different storage levels for persisting RDDs

As we to up the cores in the best of these configurations we still see a
running time of 19-20 minutes.
Is there anything else I should be configuring to get better scale-up ?
Are there any documented TF-IDF benchmark results that I can make
comparisons with to validate (even if very approximate indirect
comparisons?)

Any advice would be much appreciated,
Thanks
Karen




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLLib-benchmarks-tp26878.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org