You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Hu You <hy...@outlook.com> on 2022/01/18 06:37:17 UTC

[ML Intermediate]: Slow fitting of Linear regression vs Sklearn

Hi team,

We are testing the performance and capability of Spark for Linear regression application to replace at least sklearn linear regression.
We firstly generated data for model fitting via sklearn.dataset.make_regression. See the generation code <https://gist.github.com/yh882317/c177a8593e3bfd30e4da63a8fcb95593#file-datageneration-py> here.

Then, we performed sklearn(Python)<https://gist.github.com/yh882317/c177a8593e3bfd30e4da63a8fcb95593#file-linearregression-py> and Spark(scala)<https://gist.github.com/yh882317/c177a8593e3bfd30e4da63a8fcb95593#file-linearregression-scala> for model fitting. We setup our models according to the post in stackoverflow<https://stackoverflow.com/questions/42729431/spark-ml-regressions-do-not-calculate-same-models-as-scikit-learn> to try to get same model in both sides.
The Spark setting is:

l  Spark3.1

l  Spark-shell -I

l  Spark Standalone with  3 exectors(8 cores;128GB per each)

l  Driver: spark.driver.maxResultSize=20g; spark.driver.memory=100g; spark.executor.memory=128g

Fitting speed:
Python:19S
Spark: 600s+

Resource Utilization:
Python: 150GB
Spark: Node1(50GB+);Node2,3 and Driver node(2GB)

Got stuck on
MapPartitionsRDD [23]
treeAggregate at WeightedLeastSquares.scala:107

What we tried:

Ø  maxTreeDepth

Ø  repartion

Ø  standerlization
None of them have significent effect on Trainning speed.

Could you please help to figure out where the issue comes from? 20X slower is not acceptable for us despite Spark has other good features.

Thanks!
You Hu