You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Ulanov, Alexander" <al...@hpe.com> on 2016/01/04 22:06:18 UTC

RE: SparkML algos limitations question.

Hi Yanbo,

As long as two models fit into memory of a single machine, there should be no problems, so even 16GB machines can handle large models. (master should have more memory because it runs LBFGS) In my experiments, I’ve trained the models 12M and 32M parameters without issues.

Best regards, Alexander

From: Yanbo Liang [mailto:ybliang8@gmail.com]
Sent: Sunday, December 27, 2015 2:23 AM
To: Joseph Bradley
Cc: Eugene Morozov; user; dev@spark.apache.org
Subject: Re: SparkML algos limitations question.

Hi Eugene,

AFAIK, the current implementation of MultilayerPerceptronClassifier have some scalability problems if the model is very huge (such as >10M), although I think the limitation can cover many use cases already.

Yanbo

2015-12-16 6:00 GMT+08:00 Joseph Bradley <jo...@databricks.com>>:
Hi Eugene,

The maxDepth parameter exists because the implementation uses Integer node IDs which correspond to positions in the binary tree.  This simplified the implementation.  I'd like to eventually modify it to avoid depending on tree node IDs, but that is not yet on the roadmap.

There is not an analogous limit for the GLMs you listed, but I'm not very familiar with the perceptron implementation.

Joseph

On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov <ev...@gmail.com>> wrote:
Hello!

I'm currently working on POC and try to use Random Forest (classification and regression). I also have to check SVM and Multiclass perceptron (other algos are less important at the moment). So far I've discovered that Random Forest has a limitation of maxDepth for trees and just out of curiosity I wonder why such a limitation has been introduced?

An actual question is that I'm going to use Spark ML in production next year and would like to know if there are other limitations like maxDepth in RF for other algorithms: Logistic Regression, Perceptron, SVM, etc.

Thanks in advance for your time.
--
Be well!
Jean Morozov

Re: SparkML algos limitations question.

Posted by Yanbo Liang <yb...@gmail.com>.

Hi Alexander,

That's cool! Thanks for the clarification.

Yanbo

2016-01-05 5:06 GMT+08:00 Ulanov, Alexander <al...@hpe.com>:

> Hi Yanbo,
>
>
>
> As long as two models fit into memory of a single machine, there should be
> no problems, so even 16GB machines can handle large models. (master should
> have more memory because it runs LBFGS) In my experiments, I’ve trained the
> models 12M and 32M parameters without issues.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Yanbo Liang [mailto:ybliang8@gmail.com]
> *Sent:* Sunday, December 27, 2015 2:23 AM
> *To:* Joseph Bradley
> *Cc:* Eugene Morozov; user; dev@spark.apache.org
> *Subject:* Re: SparkML algos limitations question.
>
>
>
> Hi Eugene,
>
>
>
> AFAIK, the current implementation of MultilayerPerceptronClassifier have
> some scalability problems if the model is very huge (such as >10M),
> although I think the limitation can cover many use cases already.
>
>
>
> Yanbo
>
>
>
> 2015-12-16 6:00 GMT+08:00 Joseph Bradley <jo...@databricks.com>:
>
> Hi Eugene,
>
>
>
> The maxDepth parameter exists because the implementation uses Integer node
> IDs which correspond to positions in the binary tree.  This simplified the
> implementation.  I'd like to eventually modify it to avoid depending on
> tree node IDs, but that is not yet on the roadmap.
>
>
>
> There is not an analogous limit for the GLMs you listed, but I'm not very
> familiar with the perceptron implementation.
>
>
>
> Joseph
>
>
>
> On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov <
> evgeny.a.morozov@gmail.com> wrote:
>
> Hello!
>
>
>
> I'm currently working on POC and try to use Random Forest (classification
> and regression). I also have to check SVM and Multiclass perceptron (other
> algos are less important at the moment). So far I've discovered that Random
> Forest has a limitation of maxDepth for trees and just out of curiosity I
> wonder why such a limitation has been introduced?
>
>
>
> An actual question is that I'm going to use Spark ML in production next
> year and would like to know if there are other limitations like maxDepth in
> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc.
>
>
>
> Thanks in advance for your time.
>
> --
> Be well!
> Jean Morozov
>
>
>
>
>

Re: SparkML algos limitations question.

Posted by Yanbo Liang <yb...@gmail.com>.

Hi Alexander,

That's cool! Thanks for the clarification.

Yanbo

2016-01-05 5:06 GMT+08:00 Ulanov, Alexander <al...@hpe.com>:

> Hi Yanbo,
>
>
>
> As long as two models fit into memory of a single machine, there should be
> no problems, so even 16GB machines can handle large models. (master should
> have more memory because it runs LBFGS) In my experiments, I’ve trained the
> models 12M and 32M parameters without issues.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Yanbo Liang [mailto:ybliang8@gmail.com]
> *Sent:* Sunday, December 27, 2015 2:23 AM
> *To:* Joseph Bradley
> *Cc:* Eugene Morozov; user; dev@spark.apache.org
> *Subject:* Re: SparkML algos limitations question.
>
>
>
> Hi Eugene,
>
>
>
> AFAIK, the current implementation of MultilayerPerceptronClassifier have
> some scalability problems if the model is very huge (such as >10M),
> although I think the limitation can cover many use cases already.
>
>
>
> Yanbo
>
>
>
> 2015-12-16 6:00 GMT+08:00 Joseph Bradley <jo...@databricks.com>:
>
> Hi Eugene,
>
>
>
> The maxDepth parameter exists because the implementation uses Integer node
> IDs which correspond to positions in the binary tree.  This simplified the
> implementation.  I'd like to eventually modify it to avoid depending on
> tree node IDs, but that is not yet on the roadmap.
>
>
>
> There is not an analogous limit for the GLMs you listed, but I'm not very
> familiar with the perceptron implementation.
>
>
>
> Joseph
>
>
>
> On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov <
> evgeny.a.morozov@gmail.com> wrote:
>
> Hello!
>
>
>
> I'm currently working on POC and try to use Random Forest (classification
> and regression). I also have to check SVM and Multiclass perceptron (other
> algos are less important at the moment). So far I've discovered that Random
> Forest has a limitation of maxDepth for trees and just out of curiosity I
> wonder why such a limitation has been introduced?
>
>
>
> An actual question is that I'm going to use Spark ML in production next
> year and would like to know if there are other limitations like maxDepth in
> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc.
>
>
>
> Thanks in advance for your time.
>
> --
> Be well!
> Jean Morozov
>
>
>
>
>