You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Abhijith Chandraprabhu <ab...@gmail.com> on 2016/05/03 07:02:51 UTC

Performance benchmarking of Spark Vs other languages

Hello,

I am trying to find some performance figures of spark vs various other
languages for ALS based recommender system. I am using 20 million ratings
movielens dataset. The test environment involves one big 30 core machine
with 132 GB memory. I am using the scala version of the script provided
here,
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
<http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html>

I am not an expert in spark, and I assume that varying the n while invoking
spark with following flags, --master local[n], is supposed to provide ideal
scaling.

Initial observations didnt favour spark by some small margins, but as I
said since I am not a spark expert, I would comment only after being
assured that this is the most optimal way of running the ALS snippet.

Could the experts please help me with the most optimal way to get the best
timings out of sparks ALS example on the mentioned environment. Thanks.

-- 
Best regards,
Abhijith

Re: Performance benchmarking of Spark Vs other languages

Posted by Jörn Franke <jo...@gmail.com>.

Hallo,

Spark is a general framework for distributed in-memory processing. You can always write a highly-specified piece of code which is faster than Spark, but then it can do only one thing and if you need something else you will have to rewrite everything from scratch . This is why Spark is beneficial.
In this context, your setup does not make sense. You should have at least 5 worker nodes to make evaluations.
Follow the Spark tuning and recommendation guide.

> On 03 May 2016, at 07:02, Abhijith Chandraprabhu <ab...@gmail.com> wrote:
> 
> Hello,
> 
> I am trying to find some performance figures of spark vs various other languages for ALS based recommender system. I am using 20 million ratings movielens dataset. The test environment involves one big 30 core machine with 132 GB memory. I am using the scala version of the script provided here,
> http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html 
> 
> I am not an expert in spark, and I assume that varying the n while invoking spark with following flags, --master local[n], is supposed to provide ideal scaling. 
> 
> Initial observations didnt favour spark by some small margins, but as I said since I am not a spark expert, I would comment only after being assured that this is the most optimal way of running the ALS snippet. 
> 
> Could the experts please help me with the most optimal way to get the best timings out of sparks ALS example on the mentioned environment. Thanks.
> 
> -- 
> Best regards,
> Abhijith