You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Sereday, Scott" <Sc...@nielsen.com> on 2015/08/13 22:24:17 UTC
New Spark User - GBM iterations and Spark benchmarks
I am exploring utilizing Spark as my dataset is becoming more and more difficult to manage and analyze. I'd appreciate if anyone could provide feedback on the following questions for me:
* I am especially interested in training large datasets using machine learning algorithms.
o Does PySpark have a Gradient Boosting Machine package that allows to user to run multiple iterations in the same command, similar to R's caret package?
* Also, does anyone know of benchmarks that illustrate when Spark is most (and least) appropriate to use?
o I've often hear "when your data is not manageable on one computer", but I'd appreciate more concrete comparisons if possible.
o If anyone has benchmarks that consider data size, type of operation, etc. that would be extremely helpful.
? At what point does the efficiency overtake the overhead and when is substantially faster (compared to R's caret/gbm, h2o, python, etc)?
Thanks so much