You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Colin Beckingham <co...@kingston.net> on 2016/07/27 20:31:51 UTC

Run times for Spark 1.6.2 compared to 2.1.0?

I have a project which runs fine in both Spark 1.6.2 and 2.1.0. It 
calculates a logistic model using MLlib. I compiled the 2.1 today from 
source and took the version 1 as a precompiled version with Hadoop. The 
odd thing is that on 1.6.2 the project produces an answer in 350 sec and 
the 2.1.0 takes 990 sec. Identical code using pyspark. I'm wondering if 
there is something in the setup params for 1.6 and 2.1, say number of 
executors or memory allocation, which might account for this? I'm using 
just the 4 cores of my machine as master and executors.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Run times for Spark 1.6.2 compared to 2.1.0?

Posted by Colin Beckingham <co...@kingston.net>.

On 27/07/16 16:31, Colin Beckingham wrote:
> I have a project which runs fine in both Spark 1.6.2 and 2.1.0. It 
> calculates a logistic model using MLlib. I compiled the 2.1 today from 
> source and took the version 1 as a precompiled version with Hadoop. 
> The odd thing is that on 1.6.2 the project produces an answer in 350 
> sec and the 2.1.0 takes 990 sec. Identical code using pyspark. I'm 
> wondering if there is something in the setup params for 1.6 and 2.1, 
> say number of executors or memory allocation, which might account for 
> this? I'm using just the 4 cores of my machine as master and executors.
FWIW I have a bit more information. Watching the jobs as Spark runs I 
can see that when performing the logistic regression in Spark 1.6.2 the 
PySpark call "LogisticRegressionWithLBFGS.train()"  runs "treeAggregate 
at LBFGS.scala:218" but the same command in pyspark with Spark 2.1 runs 
"treeAggregate at LogisticRegression.scala:1092". This last command 
takes about 3 times longer to run than the LBFGS version, and there are 
way more of these calls, and the result is considerably less accurate 
than the LBFGS. The rest of the process seems to be pretty close. So 
Spark 2.1 does not seem to be running an optimized version of logistic 
regression algorithm?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org