You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Carlo.Allocca" <ca...@open.ac.uk> on 2016/11/03 10:35:33 UTC

LinearRegressionWithSGD and Rank Features By Importance

Hi All,

I am using SPARK and in particular the MLib library.

import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.regression.LinearRegressionModel;
import org.apache.spark.mllib.regression.LinearRegressionWithSGD;

For my problem I am using the LinearRegressionWithSGD and I would like to perform a “Rank Features By Importance”.

I checked the documentation and it seems that does not provide such methods.

Am I missing anything?  Please, could you provide any help on this?
Should I change the approach?

Many Thanks in advance,

Best Regards,
Carlo


-- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: LinearRegressionWithSGD and Rank Features By Importance

Posted by "Carlo.Allocca" <ca...@open.ac.uk>.

Hi Robin,

On 4 Nov 2016, at 09:19, Robin East <ro...@xense.co.uk>> wrote:

Hi

Do you mean the test of significance that you usually get with R output?
Yes, exactly.

I don’t think there is anything implemented in the standard MLLib libraries however I believe that the sparkR version provides that. See http://spark.apache.org/docs/1.6.2/sparkr.html#gaussian-glm-model

Glad to hear that as it means that I m not missing much.

Many Thanks.

Best Regards,
Carlo

-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action





On 4 Nov 2016, at 07:38, Carlo.Allocca <ca...@open.ac.uk>> wrote:

Hi Mohit,

Thank you for your reply.
OK. it means coefficient with high score are more important that other with low score…

Many Thanks,
Best Regards,
Carlo


On 3 Nov 2016, at 20:41, Mohit Jaggi <mo...@gmail.com>> wrote:

For linear regression, it should be fairly easy. Just sort the co-efficients :)

Mohit Jaggi
Founder,
Data Orchard LLC
www.dataorchardllc.com<http://www.dataorchardllc.com/>




On Nov 3, 2016, at 3:35 AM, Carlo.Allocca <ca...@open.ac.uk>> wrote:

Hi All,

I am using SPARK and in particular the MLib library.

import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.regression.LinearRegressionModel;
import org.apache.spark.mllib.regression.LinearRegressionWithSGD;

For my problem I am using the LinearRegressionWithSGD and I would like to perform a “Rank Features By Importance”.

I checked the documentation and it seems that does not provide such methods.

Am I missing anything?  Please, could you provide any help on this?
Should I change the approach?

Many Thanks in advance,

Best Regards,
Carlo


-- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>




---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: LinearRegressionWithSGD and Rank Features By Importance

Posted by Robin East <ro...@xense.co.uk>.

Hi 

Do you mean the test of significance that you usually get with R output? I don’t think there is anything implemented in the standard MLLib libraries however I believe that the sparkR version provides that. See http://spark.apache.org/docs/1.6.2/sparkr.html#gaussian-glm-model

-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>





> On 4 Nov 2016, at 07:38, Carlo.Allocca <ca...@open.ac.uk> wrote:
> 
> Hi Mohit, 
> 
> Thank you for your reply. 
> OK. it means coefficient with high score are more important that other with low score…
> 
> Many Thanks,
> Best Regards,
> Carlo
> 
> 
>> On 3 Nov 2016, at 20:41, Mohit Jaggi <mo...@gmail.com> wrote:
>> 
>> For linear regression, it should be fairly easy. Just sort the co-efficients :)
>> 
>> Mohit Jaggi
>> Founder,
>> Data Orchard LLC
>> www.dataorchardllc.com
>> 
>> 
>> 
>> 
>>> On Nov 3, 2016, at 3:35 AM, Carlo.Allocca <ca...@open.ac.uk> wrote:
>>> 
>>> Hi All,
>>> 
>>> I am using SPARK and in particular the MLib library.
>>> 
>>> import org.apache.spark.mllib.regression.LabeledPoint;
>>> import org.apache.spark.mllib.regression.LinearRegressionModel;
>>> import org.apache.spark.mllib.regression.LinearRegressionWithSGD;
>>> 
>>> For my problem I am using the LinearRegressionWithSGD and I would like to perform a “Rank Features By Importance”.
>>> 
>>> I checked the documentation and it seems that does not provide such methods.
>>> 
>>> Am I missing anything?  Please, could you provide any help on this?
>>> Should I change the approach?
>>> 
>>> Many Thanks in advance,
>>> 
>>> Best Regards,
>>> Carlo
>>> 
>>> 
>>> -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority.
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>> 
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

Re: LinearRegressionWithSGD and Rank Features By Importance

Posted by "Carlo.Allocca" <ca...@open.ac.uk>.

Hi Masood,

Thanks for the answer.
Sure. I will do as suggested.

Many Thanks,
Best Regards,
Carlo
On 8 Nov 2016, at 17:19, Masood Krohy <ma...@intact.net>> wrote:

labels

-- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority.

Re: LinearRegressionWithSGD and Rank Features By Importance

Posted by Masood Krohy <ma...@intact.net>.

No, you do not scale back the predicted value. The output values (labels) 
were never scaled; only input features were scaled.

For prediction on new samples, you scale the new sample first using the 
avg/std that you calculated for each feature when you trained your model, 
then feed it to the trained model. If it's a classification problem, then 
you're done here, a class is predicted based on the trained model. If it's 
a regression problem, then the predicted value does not need scaling back; 
it is in the same scale as your original output values you used when you 
trained your model.

This is now becoming more of a Data Science/ML problem and not a Spark 
issue and is probably best kept off this list. Do some reading on the 
topic and get back to me direct; I'll respond when possible.

Hope this has helped.

Masood

------------------------------
Masood Krohy, Ph.D. 
Data Scientist, Intact Lab-R&D 
Intact Financial Corporation 
http://ca.linkedin.com/in/masoodkh 

De :    Carlo.Allocca <ca...@open.ac.uk>
A :     Masood Krohy <ma...@intact.net>
Cc :    Carlo.Allocca <ca...@open.ac.uk>, Mohit Jaggi 
<mo...@gmail.com>, "user@spark.apache.org" <us...@spark.apache.org>
Date :  2016-11-08 11:02
Objet : Re: LinearRegressionWithSGD and Rank Features By Importance

Hi Masood, 

Thank you again for your suggestion. 
I have got a question about the following: 

For prediction on new samples, you need to scale each sample first before 
making predictions using your trained model. 

When applying the ML linear model as suggested above, it means that the 
predicted value is scaled. My question: Does it need be scaled-back? I 
mean to apply  the inverse of "calculate the average and std for each 
feature, deduct the avg, then divide by std.” to the predicted-value?
In practice, (predicted-value * std) + avg? 

Is that correct? Am I missing anything?

Many Thanks in advance. 
Best Regards,
Carlo

On 7 Nov 2016, at 17:14, carlo allocca <ca...@open.ac.uk> wrote:

I found it just google 
http://sebastianraschka.com/Articles/2014_about_feature_scaling.html 

Thanks.
Carlo
On 7 Nov 2016, at 17:12, carlo allocca <ca...@open.ac.uk> wrote:

Hi Masood, 

Thank you very much for your insight. 
I am going to scale all my features as you described. 

As I am beginners, Is there any paper/book that would explain the 
suggested approaches? I would love to read. 

Many Thanks,
Best Regards,
Carlo

On 7 Nov 2016, at 16:27, Masood Krohy <ma...@intact.net> wrote:

Yes, you would want to scale those features before feeding into any 
algorithm, one typical way would be to calculate the average and std for 
each feature, deduct the avg, then divide by std. Dividing by "max - min" 
is also a good option if you're sure there is no outlier shooting up your 
max or lowering your min significantly for each feature. After you have 
scaled each feature, then you can feed the data into the algo for 
training. 

For prediction on new samples, you need to scale each sample first before 
making predictions using your trained model. 

It's not too complicated to implement manually, but Spark API has some 
support for this already: 
ML: http://spark.apache.org/docs/latest/ml-features.html#standardscaler 
MLlib: 
http://spark.apache.org/docs/latest/mllib-feature-extraction.html#standardscaler 

Masood 

------------------------------
Masood Krohy, Ph.D. 
Data Scientist, Intact Lab-R&D 
Intact Financial Corporation 
http://ca.linkedin.com/in/masoodkh 

De :        Carlo.Allocca <ca...@open.ac.uk> 
A :        Masood Krohy <ma...@intact.net> 
Cc :        Carlo.Allocca <ca...@open.ac.uk>, Mohit Jaggi <
mohitjaggi@gmail.com>, "user@spark.apache.org" <us...@spark.apache.org> 
Date :        2016-11-07 10:50 
Objet :        Re: LinearRegressionWithSGD and Rank Features By Importance 

Hi Masood, 

thank you very much for the reply. It is very a good point as I am getting 
very bed result so far. 

If I understood well what you suggest is to scale the date below (it is 
part of my dataset) before applying linear regression SGD. 

is it correct? 

Many Thanks in advance. 

Best Regards, 
Carlo 

<Mail Attachment.png> 

On 7 Nov 2016, at 15:31, Masood Krohy <ma...@intact.net> wrote: 

If you go down this route (look at actual coefficients/weights), then make 
sure your features are scaled first and have more or less the same mean 
when feeding them into the algo. If not, then actual coefficients/weights 
wouldn't tell you much. In any case, SGD performs badly with unscaled 
features, so you gain if you scale the features beforehand. 
Masood 

------------------------------
Masood Krohy, Ph.D. 
Data Scientist, Intact Lab-R&D 
Intact Financial Corporation 
http://ca.linkedin.com/in/masoodkh 

De :        Carlo.Allocca <ca...@open.ac.uk> 
A :        Mohit Jaggi <mo...@gmail.com> 
Cc :        Carlo.Allocca <ca...@open.ac.uk>, "
user@spark.apache.org" <us...@spark.apache.org> 
Date :        2016-11-04 03:39 
Objet :        Re: LinearRegressionWithSGD and Rank Features By Importance 

Hi Mohit, 

Thank you for your reply. 
OK. it means coefficient with high score are more important that other 
with low score…

Many Thanks,
Best Regards,
Carlo

> On 3 Nov 2016, at 20:41, Mohit Jaggi <mo...@gmail.com> wrote:
> 
> For linear regression, it should be fairly easy. Just sort the 
co-efficients :)
> 
> Mohit Jaggi
> Founder,
> Data Orchard LLC
> www.dataorchardllc.com
> 
> 
> 
> 
>> On Nov 3, 2016, at 3:35 AM, Carlo.Allocca <ca...@open.ac.uk> 
wrote:
>> 
>> Hi All,
>> 
>> I am using SPARK and in particular the MLib library.
>> 
>> import org.apache.spark.mllib.regression.LabeledPoint;
>> import org.apache.spark.mllib.regression.LinearRegressionModel;
>> import org.apache.spark.mllib.regression.LinearRegressionWithSGD;
>> 
>> For my problem I am using the LinearRegressionWithSGD and I would like 
to perform a “Rank Features By Importance”.
>> 
>> I checked the documentation and it seems that does not provide such 
methods.
>> 
>> Am I missing anything?  Please, could you provide any help on this?
>> Should I change the approach?
>> 
>> Many Thanks in advance,
>> 
>> Best Regards,
>> Carlo
>> 
>> 
>> -- The Open University is incorporated by Royal Charter (RC 000391), an 
exempt charity in England & Wales and a charity registered in Scotland (SC 
038302). The Open University is authorised and regulated by the Financial 
Conduct Authority.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> 
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: LinearRegressionWithSGD and Rank Features By Importance

Posted by "Carlo.Allocca" <ca...@open.ac.uk>.

Hi Masood,

Thank you again for your suggestion.
I have got a question about the following:

For prediction on new samples, you need to scale each sample first before making predictions using your trained model.

When applying the ML linear model as suggested above, it means that the predicted value is scaled. My question: Does it need be scaled-back? I mean to apply  the inverse of "calculate the average and std for each feature, deduct the avg, then divide by std.” to the predicted-value?
In practice, (predicted-value * std) + avg?

Is that correct? Am I missing anything?

Many Thanks in advance.
Best Regards,
Carlo

On 7 Nov 2016, at 17:14, carlo allocca <ca...@open.ac.uk>> wrote:

I found it just google http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

Thanks.
Carlo
On 7 Nov 2016, at 17:12, carlo allocca <ca...@open.ac.uk>> wrote:

Hi Masood,

Thank you very much for your insight.
I am going to scale all my features as you described.

As I am beginners, Is there any paper/book that would explain the suggested approaches? I would love to read.

Many Thanks,
Best Regards,
Carlo

On 7 Nov 2016, at 16:27, Masood Krohy <ma...@intact.net>> wrote:

Yes, you would want to scale those features before feeding into any algorithm, one typical way would be to calculate the average and std for each feature, deduct the avg, then divide by std. Dividing by "max - min" is also a good option if you're sure there is no outlier shooting up your max or lowering your min significantly for each feature. After you have scaled each feature, then you can feed the data into the algo for training.

For prediction on new samples, you need to scale each sample first before making predictions using your trained model.

It's not too complicated to implement manually, but Spark API has some support for this already:
ML: http://spark.apache.org/docs/latest/ml-features.html#standardscaler
MLlib: http://spark.apache.org/docs/latest/mllib-feature-extraction.html#standardscaler

Masood

------------------------------
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R&D
Intact Financial Corporation
http://ca.linkedin.com/in/masoodkh

De :        Carlo.Allocca <ca...@open.ac.uk>>
A :        Masood Krohy <ma...@intact.net>>
Cc :        Carlo.Allocca <ca...@open.ac.uk>>, Mohit Jaggi <mo...@gmail.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Date :        2016-11-07 10:50
Objet :        Re: LinearRegressionWithSGD and Rank Features By Importance

________________________________

Hi Masood,

thank you very much for the reply. It is very a good point as I am getting very bed result so far.

If I understood well what you suggest is to scale the date below (it is part of my dataset) before applying linear regression SGD.

is it correct?

Many Thanks in advance.

Best Regards,
Carlo

<Mail Attachment.png>

On 7 Nov 2016, at 15:31, Masood Krohy <ma...@intact.net>> wrote:

If you go down this route (look at actual coefficients/weights), then make sure your features are scaled first and have more or less the same mean when feeding them into the algo. If not, then actual coefficients/weights wouldn't tell you much. In any case, SGD performs badly with unscaled features, so you gain if you scale the features beforehand.

Masood

------------------------------
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R&D
Intact Financial Corporation
http://ca.linkedin.com/in/masoodkh

De :        Carlo.Allocca <ca...@open.ac.uk>>
A :        Mohit Jaggi <mo...@gmail.com>>
Cc :        Carlo.Allocca <ca...@open.ac.uk>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Date :        2016-11-04 03:39
Objet :        Re: LinearRegressionWithSGD and Rank Features By Importance

________________________________

Hi Mohit,

Thank you for your reply.
OK. it means coefficient with high score are more important that other with low score…

Many Thanks,
Best Regards,
Carlo

> On 3 Nov 2016, at 20:41, Mohit Jaggi <mo...@gmail.com>> wrote:
>
> For linear regression, it should be fairly easy. Just sort the co-efficients :)
>
> Mohit Jaggi
> Founder,
> Data Orchard LLC
> www.dataorchardllc.com<x-msg://61/www.dataorchardllc.com>
>
>
>
>
>> On Nov 3, 2016, at 3:35 AM, Carlo.Allocca <ca...@open.ac.uk>> wrote:
>>
>> Hi All,
>>
>> I am using SPARK and in particular the MLib library.
>>
>> import org.apache.spark.mllib.regression.LabeledPoint;
>> import org.apache.spark.mllib.regression.LinearRegressionModel;
>> import org.apache.spark.mllib.regression.LinearRegressionWithSGD;
>>
>> For my problem I am using the LinearRegressionWithSGD and I would like to perform a “Rank Features By Importance”.
>>
>> I checked the documentation and it seems that does not provide such methods.
>>
>> Am I missing anything?  Please, could you provide any help on this?
>> Should I change the approach?
>>
>> Many Thanks in advance,
>>
>> Best Regards,
>> Carlo
>>
>>
>> -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: LinearRegressionWithSGD and Rank Features By Importance

Posted by "Carlo.Allocca" <ca...@open.ac.uk>.

I found it just google http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

Thanks.
Carlo
On 7 Nov 2016, at 17:12, carlo allocca <ca...@open.ac.uk>> wrote:

Hi Masood,

Thank you very much for your insight.
I am going to scale all my features as you described.

As I am beginners, Is there any paper/book that would explain the suggested approaches? I would love to read.

Many Thanks,
Best Regards,
Carlo

On 7 Nov 2016, at 16:27, Masood Krohy <ma...@intact.net>> wrote:

Yes, you would want to scale those features before feeding into any algorithm, one typical way would be to calculate the average and std for each feature, deduct the avg, then divide by std. Dividing by "max - min" is also a good option if you're sure there is no outlier shooting up your max or lowering your min significantly for each feature. After you have scaled each feature, then you can feed the data into the algo for training.

For prediction on new samples, you need to scale each sample first before making predictions using your trained model.

It's not too complicated to implement manually, but Spark API has some support for this already:
ML: http://spark.apache.org/docs/latest/ml-features.html#standardscaler
MLlib: http://spark.apache.org/docs/latest/mllib-feature-extraction.html#standardscaler

Masood

------------------------------
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R&D
Intact Financial Corporation
http://ca.linkedin.com/in/masoodkh

De :        Carlo.Allocca <ca...@open.ac.uk>>
A :        Masood Krohy <ma...@intact.net>>
Cc :        Carlo.Allocca <ca...@open.ac.uk>>, Mohit Jaggi <mo...@gmail.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Date :        2016-11-07 10:50
Objet :        Re: LinearRegressionWithSGD and Rank Features By Importance

________________________________

Hi Masood,

thank you very much for the reply. It is very a good point as I am getting very bed result so far.

If I understood well what you suggest is to scale the date below (it is part of my dataset) before applying linear regression SGD.

is it correct?

Many Thanks in advance.

Best Regards,
Carlo

<Mail Attachment.png>

On 7 Nov 2016, at 15:31, Masood Krohy <ma...@intact.net>> wrote:

If you go down this route (look at actual coefficients/weights), then make sure your features are scaled first and have more or less the same mean when feeding them into the algo. If not, then actual coefficients/weights wouldn't tell you much. In any case, SGD performs badly with unscaled features, so you gain if you scale the features beforehand.

Masood

------------------------------
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R&D
Intact Financial Corporation
http://ca.linkedin.com/in/masoodkh

De :        Carlo.Allocca <ca...@open.ac.uk>>
A :        Mohit Jaggi <mo...@gmail.com>>
Cc :        Carlo.Allocca <ca...@open.ac.uk>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Date :        2016-11-04 03:39
Objet :        Re: LinearRegressionWithSGD and Rank Features By Importance

________________________________

Hi Mohit,

Thank you for your reply.
OK. it means coefficient with high score are more important that other with low score…

Many Thanks,
Best Regards,
Carlo

> On 3 Nov 2016, at 20:41, Mohit Jaggi <mo...@gmail.com>> wrote:
>
> For linear regression, it should be fairly easy. Just sort the co-efficients :)
>
> Mohit Jaggi
> Founder,
> Data Orchard LLC
> www.dataorchardllc.com<x-msg://61/www.dataorchardllc.com>
>
>
>
>
>> On Nov 3, 2016, at 3:35 AM, Carlo.Allocca <ca...@open.ac.uk>> wrote:
>>
>> Hi All,
>>
>> I am using SPARK and in particular the MLib library.
>>
>> import org.apache.spark.mllib.regression.LabeledPoint;
>> import org.apache.spark.mllib.regression.LinearRegressionModel;
>> import org.apache.spark.mllib.regression.LinearRegressionWithSGD;
>>
>> For my problem I am using the LinearRegressionWithSGD and I would like to perform a “Rank Features By Importance”.
>>
>> I checked the documentation and it seems that does not provide such methods.
>>
>> Am I missing anything?  Please, could you provide any help on this?
>> Should I change the approach?
>>
>> Many Thanks in advance,
>>
>> Best Regards,
>> Carlo
>>
>>
>> -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: LinearRegressionWithSGD and Rank Features By Importance

Posted by "Carlo.Allocca" <ca...@open.ac.uk>.

Hi Masood,

Thank you very much for your insight.
I am going to scale all my features as you described.

As I am beginners, Is there any paper/book that would explain the suggested approaches? I would love to read.

Many Thanks,
Best Regards,
Carlo

On 7 Nov 2016, at 16:27, Masood Krohy <ma...@intact.net>> wrote:

Yes, you would want to scale those features before feeding into any algorithm, one typical way would be to calculate the average and std for each feature, deduct the avg, then divide by std. Dividing by "max - min" is also a good option if you're sure there is no outlier shooting up your max or lowering your min significantly for each feature. After you have scaled each feature, then you can feed the data into the algo for training.

For prediction on new samples, you need to scale each sample first before making predictions using your trained model.

It's not too complicated to implement manually, but Spark API has some support for this already:
ML: http://spark.apache.org/docs/latest/ml-features.html#standardscaler
MLlib: http://spark.apache.org/docs/latest/mllib-feature-extraction.html#standardscaler

Masood

------------------------------
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R&D
Intact Financial Corporation
http://ca.linkedin.com/in/masoodkh

De :        Carlo.Allocca <ca...@open.ac.uk>>
A :        Masood Krohy <ma...@intact.net>>
Cc :        Carlo.Allocca <ca...@open.ac.uk>>, Mohit Jaggi <mo...@gmail.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Date :        2016-11-07 10:50
Objet :        Re: LinearRegressionWithSGD and Rank Features By Importance

________________________________

Hi Masood,

thank you very much for the reply. It is very a good point as I am getting very bed result so far.

If I understood well what you suggest is to scale the date below (it is part of my dataset) before applying linear regression SGD.

is it correct?

Many Thanks in advance.

Best Regards,
Carlo

<Mail Attachment.png>

On 7 Nov 2016, at 15:31, Masood Krohy <ma...@intact.net>> wrote:

If you go down this route (look at actual coefficients/weights), then make sure your features are scaled first and have more or less the same mean when feeding them into the algo. If not, then actual coefficients/weights wouldn't tell you much. In any case, SGD performs badly with unscaled features, so you gain if you scale the features beforehand.

Masood

------------------------------
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R&D
Intact Financial Corporation
http://ca.linkedin.com/in/masoodkh

De :        Carlo.Allocca <ca...@open.ac.uk>>
A :        Mohit Jaggi <mo...@gmail.com>>
Cc :        Carlo.Allocca <ca...@open.ac.uk>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Date :        2016-11-04 03:39
Objet :        Re: LinearRegressionWithSGD and Rank Features By Importance

________________________________

Hi Mohit,

Thank you for your reply.
OK. it means coefficient with high score are more important that other with low score…

Many Thanks,
Best Regards,
Carlo

> On 3 Nov 2016, at 20:41, Mohit Jaggi <mo...@gmail.com>> wrote:
>
> For linear regression, it should be fairly easy. Just sort the co-efficients :)
>
> Mohit Jaggi
> Founder,
> Data Orchard LLC
> www.dataorchardllc.com<x-msg://61/www.dataorchardllc.com>
>
>
>
>
>> On Nov 3, 2016, at 3:35 AM, Carlo.Allocca <ca...@open.ac.uk>> wrote:
>>
>> Hi All,
>>
>> I am using SPARK and in particular the MLib library.
>>
>> import org.apache.spark.mllib.regression.LabeledPoint;
>> import org.apache.spark.mllib.regression.LinearRegressionModel;
>> import org.apache.spark.mllib.regression.LinearRegressionWithSGD;
>>
>> For my problem I am using the LinearRegressionWithSGD and I would like to perform a “Rank Features By Importance”.
>>
>> I checked the documentation and it seems that does not provide such methods.
>>
>> Am I missing anything?  Please, could you provide any help on this?
>> Should I change the approach?
>>
>> Many Thanks in advance,
>>
>> Best Regards,
>> Carlo
>>
>>
>> -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: LinearRegressionWithSGD and Rank Features By Importance

Posted by Masood Krohy <ma...@intact.net>.

Yes, you would want to scale those features before feeding into any 
algorithm, one typical way would be to calculate the average and std for 
each feature, deduct the avg, then divide by std. Dividing by "max - min" 
is also a good option if you're sure there is no outlier shooting up your 
max or lowering your min significantly for each feature. After you have 
scaled each feature, then you can feed the data into the algo for 
training. 

For prediction on new samples, you need to scale each sample first before 
making predictions using your trained model. 

It's not too complicated to implement manually, but Spark API has some 
support for this already:
ML: http://spark.apache.org/docs/latest/ml-features.html#standardscaler
MLlib: 
http://spark.apache.org/docs/latest/mllib-feature-extraction.html#standardscaler
Masood

------------------------------
Masood Krohy, Ph.D. 
Data Scientist, Intact Lab-R&D 
Intact Financial Corporation 
http://ca.linkedin.com/in/masoodkh 

De :    Carlo.Allocca <ca...@open.ac.uk>
A :     Masood Krohy <ma...@intact.net>
Cc :    Carlo.Allocca <ca...@open.ac.uk>, Mohit Jaggi 
<mo...@gmail.com>, "user@spark.apache.org" <us...@spark.apache.org>
Date :  2016-11-07 10:50
Objet : Re: LinearRegressionWithSGD and Rank Features By Importance

Hi Masood, 

thank you very much for the reply. It is very a good point as I am getting 
very bed result so far. 

If I understood well what you suggest is to scale the date below (it is 
part of my dataset) before applying linear regression SGD.

is it correct?

Many Thanks in advance. 

Best Regards,
Carlo 

On 7 Nov 2016, at 15:31, Masood Krohy <ma...@intact.net> wrote:

If you go down this route (look at actual coefficients/weights), then make 
sure your features are scaled first and have more or less the same mean 
when feeding them into the algo. If not, then actual coefficients/weights 
wouldn't tell you much. In any case, SGD performs badly with unscaled 
features, so you gain if you scale the features beforehand. 
Masood 

------------------------------
Masood Krohy, Ph.D. 
Data Scientist, Intact Lab-R&D 
Intact Financial Corporation 
http://ca.linkedin.com/in/masoodkh 

De :        Carlo.Allocca <ca...@open.ac.uk> 
A :        Mohit Jaggi <mo...@gmail.com> 
Cc :        Carlo.Allocca <ca...@open.ac.uk>, "
user@spark.apache.org" <us...@spark.apache.org> 
Date :        2016-11-04 03:39 
Objet :        Re: LinearRegressionWithSGD and Rank Features By Importance 

Hi Mohit, 

Thank you for your reply. 
OK. it means coefficient with high score are more important that other 
with low score…

Many Thanks,
Best Regards,
Carlo

> On 3 Nov 2016, at 20:41, Mohit Jaggi <mo...@gmail.com> wrote:
> 
> For linear regression, it should be fairly easy. Just sort the 
co-efficients :)
> 
> Mohit Jaggi
> Founder,
> Data Orchard LLC
> www.dataorchardllc.com
> 
> 
> 
> 
>> On Nov 3, 2016, at 3:35 AM, Carlo.Allocca <ca...@open.ac.uk> 
wrote:
>> 
>> Hi All,
>> 
>> I am using SPARK and in particular the MLib library.
>> 
>> import org.apache.spark.mllib.regression.LabeledPoint;
>> import org.apache.spark.mllib.regression.LinearRegressionModel;
>> import org.apache.spark.mllib.regression.LinearRegressionWithSGD;
>> 
>> For my problem I am using the LinearRegressionWithSGD and I would like 
to perform a “Rank Features By Importance”.
>> 
>> I checked the documentation and it seems that does not provide such 
methods.
>> 
>> Am I missing anything?  Please, could you provide any help on this?
>> Should I change the approach?
>> 
>> Many Thanks in advance,
>> 
>> Best Regards,
>> Carlo
>> 
>> 
>> -- The Open University is incorporated by Royal Charter (RC 000391), an 
exempt charity in England & Wales and a charity registered in Scotland (SC 
038302). The Open University is authorised and regulated by the Financial 
Conduct Authority.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> 
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: LinearRegressionWithSGD and Rank Features By Importance

Posted by Robin East <ro...@xense.co.uk>.

If you have to use SGD then scaling will usually help your algorithm to converge quicker. If possible you should try using Linear Regression in the newer ml library: http://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression


-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>





> On 7 Nov 2016, at 15:47, Carlo.Allocca <ca...@open.ac.uk> wrote:
> 
> Hi Masood, 
> 
> thank you very much for the reply. It is very a good point as I am getting very bed result so far. 
> 
> If I understood well what you suggest is to scale the date below (it is part of my dataset) before applying linear regression SGD.
> 
> is it correct?
> 
> Many Thanks in advance. 
> 
> Best Regards,
> Carlo 
> 
> <Screen Shot 2016-11-07 at 15.44.51.png>
> 
>> On 7 Nov 2016, at 15:31, Masood Krohy <masood.krohy@intact.net <ma...@intact.net>> wrote:
>> 
>> If you go down this route (look at actual coefficients/weights), then make sure your features are scaled first and have more or less the same mean when feeding them into the algo. If not, then actual coefficients/weights wouldn't tell you much. In any case, SGD performs badly with unscaled features, so you gain if you scale the features beforehand.
>> Masood 
>> 
>> ------------------------------
>> Masood Krohy, Ph.D. 
>> Data Scientist, Intact Lab-R&D 
>> Intact Financial Corporation 
>> http://ca.linkedin.com/in/masoodkh <http://ca.linkedin.com/in/masoodkh> 
>> 
>> 
>> 
>> De :        Carlo.Allocca <carlo.allocca@open.ac.uk <ma...@open.ac.uk>> 
>> A :        Mohit Jaggi <mohitjaggi@gmail.com <ma...@gmail.com>> 
>> Cc :        Carlo.Allocca <carlo.allocca@open.ac.uk <ma...@open.ac.uk>>, "user@spark.apache.org <ma...@spark.apache.org>" <user@spark.apache.org <ma...@spark.apache.org>> 
>> Date :        2016-11-04 03:39 
>> Objet :        Re: LinearRegressionWithSGD and Rank Features By Importance 
>> 
>> 
>> 
>> Hi Mohit, 
>> 
>> Thank you for your reply. 
>> OK. it means coefficient with high score are more important that other with low score…
>> 
>> Many Thanks,
>> Best Regards,
>> Carlo
>> 
>> 
>> > On 3 Nov 2016, at 20:41, Mohit Jaggi <mohitjaggi@gmail.com <ma...@gmail.com>> wrote:
>> > 
>> > For linear regression, it should be fairly easy. Just sort the co-efficients :)
>> > 
>> > Mohit Jaggi
>> > Founder,
>> > Data Orchard LLC
>> > www.dataorchardllc.com <x-msg://61/www.dataorchardllc.com>
>> > 
>> > 
>> > 
>> > 
>> >> On Nov 3, 2016, at 3:35 AM, Carlo.Allocca <carlo.allocca@open.ac.uk <ma...@open.ac.uk>> wrote:
>> >> 
>> >> Hi All,
>> >> 
>> >> I am using SPARK and in particular the MLib library.
>> >> 
>> >> import org.apache.spark.mllib.regression.LabeledPoint;
>> >> import org.apache.spark.mllib.regression.LinearRegressionModel;
>> >> import org.apache.spark.mllib.regression.LinearRegressionWithSGD;
>> >> 
>> >> For my problem I am using the LinearRegressionWithSGD and I would like to perform a “Rank Features By Importance”.
>> >> 
>> >> I checked the documentation and it seems that does not provide such methods.
>> >> 
>> >> Am I missing anything?  Please, could you provide any help on this?
>> >> Should I change the approach?
>> >> 
>> >> Many Thanks in advance,
>> >> 
>> >> Best Regards,
>> >> Carlo
>> >> 
>> >> 
>> >> -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority.
>> >> 
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>> >> 
>> > 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>> 
>> 
>> 
>

Re: LinearRegressionWithSGD and Rank Features By Importance

Posted by "Carlo.Allocca" <ca...@open.ac.uk>.

Hi Masood,

thank you very much for the reply. It is very a good point as I am getting very bed result so far.

If I understood well what you suggest is to scale the date below (it is part of my dataset) before applying linear regression SGD.

is it correct?

Many Thanks in advance.

Best Regards,
Carlo

[cid:F4EDB74F-1133-4D02-8861-D47A80386573@eduroam.open.ac.uk]

On 7 Nov 2016, at 15:31, Masood Krohy <ma...@intact.net>> wrote:

If you go down this route (look at actual coefficients/weights), then make sure your features are scaled first and have more or less the same mean when feeding them into the algo. If not, then actual coefficients/weights wouldn't tell you much. In any case, SGD performs badly with unscaled features, so you gain if you scale the features beforehand.

Masood

------------------------------
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R&D
Intact Financial Corporation
http://ca.linkedin.com/in/masoodkh



De :        Carlo.Allocca <ca...@open.ac.uk>>
A :        Mohit Jaggi <mo...@gmail.com>>
Cc :        Carlo.Allocca <ca...@open.ac.uk>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Date :        2016-11-04 03:39
Objet :        Re: LinearRegressionWithSGD and Rank Features By Importance

________________________________



Hi Mohit,

Thank you for your reply.
OK. it means coefficient with high score are more important that other with low score…

Many Thanks,
Best Regards,
Carlo


> On 3 Nov 2016, at 20:41, Mohit Jaggi <mo...@gmail.com>> wrote:
>
> For linear regression, it should be fairly easy. Just sort the co-efficients :)
>
> Mohit Jaggi
> Founder,
> Data Orchard LLC
> www.dataorchardllc.com<x-msg://61/www.dataorchardllc.com>
>
>
>
>
>> On Nov 3, 2016, at 3:35 AM, Carlo.Allocca <ca...@open.ac.uk>> wrote:
>>
>> Hi All,
>>
>> I am using SPARK and in particular the MLib library.
>>
>> import org.apache.spark.mllib.regression.LabeledPoint;
>> import org.apache.spark.mllib.regression.LinearRegressionModel;
>> import org.apache.spark.mllib.regression.LinearRegressionWithSGD;
>>
>> For my problem I am using the LinearRegressionWithSGD and I would like to perform a “Rank Features By Importance”.
>>
>> I checked the documentation and it seems that does not provide such methods.
>>
>> Am I missing anything?  Please, could you provide any help on this?
>> Should I change the approach?
>>
>> Many Thanks in advance,
>>
>> Best Regards,
>> Carlo
>>
>>
>> -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
>>
>


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: LinearRegressionWithSGD and Rank Features By Importance

Posted by Masood Krohy <ma...@intact.net>.

If you go down this route (look at actual coefficients/weights), then make 
sure your features are scaled first and have more or less the same mean 
when feeding them into the algo. If not, then actual coefficients/weights 
wouldn't tell you much. In any case, SGD performs badly with unscaled 
features, so you gain if you scale the features beforehand.
Masood

------------------------------
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R&D
Intact Financial Corporation
http://ca.linkedin.com/in/masoodkh



De :    Carlo.Allocca <ca...@open.ac.uk>
A :     Mohit Jaggi <mo...@gmail.com>
Cc :    Carlo.Allocca <ca...@open.ac.uk>, "user@spark.apache.org" 
<us...@spark.apache.org>
Date :  2016-11-04 03:39
Objet : Re: LinearRegressionWithSGD and Rank Features By Importance



Hi Mohit, 

Thank you for your reply. 
OK. it means coefficient with high score are more important that other 
with low score…

Many Thanks,
Best Regards,
Carlo


> On 3 Nov 2016, at 20:41, Mohit Jaggi <mo...@gmail.com> wrote:
> 
> For linear regression, it should be fairly easy. Just sort the 
co-efficients :)
> 
> Mohit Jaggi
> Founder,
> Data Orchard LLC
> www.dataorchardllc.com
> 
> 
> 
> 
>> On Nov 3, 2016, at 3:35 AM, Carlo.Allocca <ca...@open.ac.uk> 
wrote:
>> 
>> Hi All,
>> 
>> I am using SPARK and in particular the MLib library.
>> 
>> import org.apache.spark.mllib.regression.LabeledPoint;
>> import org.apache.spark.mllib.regression.LinearRegressionModel;
>> import org.apache.spark.mllib.regression.LinearRegressionWithSGD;
>> 
>> For my problem I am using the LinearRegressionWithSGD and I would like 
to perform a “Rank Features By Importance”.
>> 
>> I checked the documentation and it seems that does not provide such 
methods.
>> 
>> Am I missing anything?  Please, could you provide any help on this?
>> Should I change the approach?
>> 
>> Many Thanks in advance,
>> 
>> Best Regards,
>> Carlo
>> 
>> 
>> -- The Open University is incorporated by Royal Charter (RC 000391), an 
exempt charity in England & Wales and a charity registered in Scotland (SC 
038302). The Open University is authorised and regulated by the Financial 
Conduct Authority.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> 
> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: LinearRegressionWithSGD and Rank Features By Importance

Posted by "Carlo.Allocca" <ca...@open.ac.uk>.

Hi Mohit, 

Thank you for your reply. 
OK. it means coefficient with high score are more important that other with low score…

Many Thanks,
Best Regards,
Carlo


> On 3 Nov 2016, at 20:41, Mohit Jaggi <mo...@gmail.com> wrote:
> 
> For linear regression, it should be fairly easy. Just sort the co-efficients :)
> 
> Mohit Jaggi
> Founder,
> Data Orchard LLC
> www.dataorchardllc.com
> 
> 
> 
> 
>> On Nov 3, 2016, at 3:35 AM, Carlo.Allocca <ca...@open.ac.uk> wrote:
>> 
>> Hi All,
>> 
>> I am using SPARK and in particular the MLib library.
>> 
>> import org.apache.spark.mllib.regression.LabeledPoint;
>> import org.apache.spark.mllib.regression.LinearRegressionModel;
>> import org.apache.spark.mllib.regression.LinearRegressionWithSGD;
>> 
>> For my problem I am using the LinearRegressionWithSGD and I would like to perform a “Rank Features By Importance”.
>> 
>> I checked the documentation and it seems that does not provide such methods.
>> 
>> Am I missing anything?  Please, could you provide any help on this?
>> Should I change the approach?
>> 
>> Many Thanks in advance,
>> 
>> Best Regards,
>> Carlo
>> 
>> 
>> -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> 
> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: LinearRegressionWithSGD and Rank Features By Importance

Posted by Mohit Jaggi <mo...@gmail.com>.

For linear regression, it should be fairly easy. Just sort the co-efficients :)

Mohit Jaggi
Founder,
Data Orchard LLC
www.dataorchardllc.com




> On Nov 3, 2016, at 3:35 AM, Carlo.Allocca <ca...@open.ac.uk> wrote:
> 
> Hi All,
> 
> I am using SPARK and in particular the MLib library.
> 
> import org.apache.spark.mllib.regression.LabeledPoint;
> import org.apache.spark.mllib.regression.LinearRegressionModel;
> import org.apache.spark.mllib.regression.LinearRegressionWithSGD;
> 
> For my problem I am using the LinearRegressionWithSGD and I would like to perform a “Rank Features By Importance”.
> 
> I checked the documentation and it seems that does not provide such methods.
> 
> Am I missing anything?  Please, could you provide any help on this?
> Should I change the approach?
> 
> Many Thanks in advance,
> 
> Best Regards,
> Carlo
> 
> 
> -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority.
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org