You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aseem Bansal <as...@gmail.com> on 2017/04/07 11:12:14 UTC

Spark 2.1 ml library scalability

When using spark ml's LogisticRegression, RandomForest, CrossValidator etc.
do we need to give any consideration while coding in making it scale with
more CPUs or does it scale automatically?

I am reading some data from S3, using a pipeline to train a model. I am
running the job on a spark cluster with 36 cores and 60GB RAM and I cannot
see much usage. It is running but I was expecting spark to use all RAM
available and make it faster. So that's why I was thinking whether we need
to take something particular in consideration or wrong expectations?

Re: Spark 2.1 ml library scalability

Posted by Nick Pentreath <ni...@gmail.com>.
It's true that CrossValidator is not parallel currently - see
https://issues.apache.org/jira/browse/SPARK-19357 and feel free to help
review.

On Fri, 7 Apr 2017 at 14:18 Aseem Bansal <as...@gmail.com> wrote:

>
>    - Limited the data to 100,000 records.
>    - 6 categorical feature which go through imputation, string indexing,
>    one hot encoding. The maximum classes for the feature is 100. As data is
>    imputated it becomes dense.
>    - 1 numerical feature.
>    - Training Logistic Regression through CrossValidation with grid to
>    optimize its regularization parameter over the values 0.0001, 0.001, 0.005,
>    0.01, 0.05, 0.1
>    - Using spark's launcher api to launch it on a yarn cluster in Amazon
>    AWS.
>
> I was thinking that as CrossValidator is finding the best parameters it
> should be able to run them independently. That sounds like something which
> could be ran in parallel.
>
>
> On Fri, Apr 7, 2017 at 5:20 PM, Nick Pentreath <ni...@gmail.com>
> wrote:
>
> What is the size of training data (number examples, number features)?
> Dense or sparse features? How many classes?
>
> What commands are you using to submit your job via spark-submit?
>
> On Fri, 7 Apr 2017 at 13:12 Aseem Bansal <as...@gmail.com> wrote:
>
> When using spark ml's LogisticRegression, RandomForest, CrossValidator
> etc. do we need to give any consideration while coding in making it scale
> with more CPUs or does it scale automatically?
>
> I am reading some data from S3, using a pipeline to train a model. I am
> running the job on a spark cluster with 36 cores and 60GB RAM and I cannot
> see much usage. It is running but I was expecting spark to use all RAM
> available and make it faster. So that's why I was thinking whether we need
> to take something particular in consideration or wrong expectations?
>
>
>

Re: Spark 2.1 ml library scalability

Posted by Aseem Bansal <as...@gmail.com>.
   - Limited the data to 100,000 records.
   - 6 categorical feature which go through imputation, string indexing,
   one hot encoding. The maximum classes for the feature is 100. As data is
   imputated it becomes dense.
   - 1 numerical feature.
   - Training Logistic Regression through CrossValidation with grid to
   optimize its regularization parameter over the values 0.0001, 0.001, 0.005,
   0.01, 0.05, 0.1
   - Using spark's launcher api to launch it on a yarn cluster in Amazon
   AWS.

I was thinking that as CrossValidator is finding the best parameters it
should be able to run them independently. That sounds like something which
could be ran in parallel.


On Fri, Apr 7, 2017 at 5:20 PM, Nick Pentreath <ni...@gmail.com>
wrote:

> What is the size of training data (number examples, number features)?
> Dense or sparse features? How many classes?
>
> What commands are you using to submit your job via spark-submit?
>
> On Fri, 7 Apr 2017 at 13:12 Aseem Bansal <as...@gmail.com> wrote:
>
>> When using spark ml's LogisticRegression, RandomForest, CrossValidator
>> etc. do we need to give any consideration while coding in making it scale
>> with more CPUs or does it scale automatically?
>>
>> I am reading some data from S3, using a pipeline to train a model. I am
>> running the job on a spark cluster with 36 cores and 60GB RAM and I cannot
>> see much usage. It is running but I was expecting spark to use all RAM
>> available and make it faster. So that's why I was thinking whether we need
>> to take something particular in consideration or wrong expectations?
>>
>

Re: Spark 2.1 ml library scalability

Posted by Nick Pentreath <ni...@gmail.com>.
What is the size of training data (number examples, number features)? Dense
or sparse features? How many classes?

What commands are you using to submit your job via spark-submit?

On Fri, 7 Apr 2017 at 13:12 Aseem Bansal <as...@gmail.com> wrote:

> When using spark ml's LogisticRegression, RandomForest, CrossValidator
> etc. do we need to give any consideration while coding in making it scale
> with more CPUs or does it scale automatically?
>
> I am reading some data from S3, using a pipeline to train a model. I am
> running the job on a spark cluster with 36 cores and 60GB RAM and I cannot
> see much usage. It is running but I was expecting spark to use all RAM
> available and make it faster. So that's why I was thinking whether we need
> to take something particular in consideration or wrong expectations?
>