You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Sereday, Scott" <Sc...@nielsen.com> on 2015/08/13 22:24:17 UTC

New Spark User - GBM iterations and Spark benchmarks

I am exploring utilizing Spark as my dataset is becoming more and more difficult to manage and analyze. I'd appreciate if anyone could provide feedback on the following questions for me:


*         I am especially interested in training large datasets using machine learning algorithms.

o   Does PySpark have a Gradient Boosting Machine package that allows to user to run multiple iterations in the same command, similar to R's caret package?

*         Also, does anyone know of benchmarks that illustrate when Spark is most (and least) appropriate to use?

o   I've often hear "when your data is not manageable on one computer", but I'd appreciate more concrete comparisons if possible.

o   If anyone has benchmarks that consider data size, type of operation, etc. that would be extremely helpful.

?  At what point does the efficiency overtake the overhead and when is substantially faster (compared to R's caret/gbm, h2o, python, etc)?

Thanks so much