You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sameer Tilak <ss...@live.com> on 2014/09/18 19:30:28 UTC

MLLib regression model weights

Hi All,
I am able to run LinearRegressionWithSGD on a small sample dataset (~60MB Libsvm file of sparse data) with 6700 features. 
val model = LinearRegressionWithSGD.train(examples, numIterations)
At the end I get a model that 
model.weights.sizeres6: Int = 6699
I am assuming each entry in the model is weight for the corresponding feature/index.  However,, if I want to get the top10 most important features or all features with weights higher than certain threshold, is that functionality available out-of-box? I can implement that on my own, but seems like a common feature that most of the people will need when they are working on high-dimensional dataset.

Re: MLLib regression model weights

Posted by Xiangrui Meng <me...@gmail.com>.

The importance should be based on some statistics, for example, the
standard deviation of the feature column and the magnitude of the
weight. If the columns are scaled to unit standard deviation (using
StandardScaler), you can tell the importance by the absolute value of
the weight. But there are other statistics for feature importance. It
would be great if you are interested in working on this. -Xiangrui

On Thu, Sep 18, 2014 at 12:17 PM, Debasish Das <de...@gmail.com> wrote:
> sc.parallelize(model.weights.toArray, blocks).top(k) will get that right ?
>
> For logistic you might want both positive and negative feature...so just
> pass it through a filter on abs and then pick top(k)
>
>
> On Thu, Sep 18, 2014 at 10:30 AM, Sameer Tilak <ss...@live.com> wrote:
>>
>> Hi All,
>>
>> I am able to run LinearRegressionWithSGD on a small sample dataset (~60MB
>> Libsvm file of sparse data) with 6700 features.
>>
>> val model = LinearRegressionWithSGD.train(examples, numIterations)
>>
>> At the end I get a model that
>>
>> model.weights.size
>> res6: Int = 6699
>>
>> I am assuming each entry in the model is weight for the corresponding
>> feature/index.  However,, if I want to get the top10 most important features
>> or all features with weights higher than certain threshold, is that
>> functionality available out-of-box? I can implement that on my own, but
>> seems like a common feature that most of the people will need when they are
>> working on high-dimensional dataset.
>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: MLLib regression model weights

Posted by Debasish Das <de...@gmail.com>.

sc.parallelize(model.weights.toArray, blocks).top(k) will get that right ?

For logistic you might want both positive and negative feature...so just
pass it through a filter on abs and then pick top(k)

On Thu, Sep 18, 2014 at 10:30 AM, Sameer Tilak <ss...@live.com> wrote:

> Hi All,
>
> I am able to run LinearRegressionWithSGD on a small sample dataset (~60MB
> Libsvm file of sparse data) with 6700 features.
>
> val model = LinearRegressionWithSGD.train(examples, numIterations)
>
> At the end I get a model that
>
> model.weights.size
> res6: Int = 6699
>
> I am assuming each entry in the model is weight for the corresponding
> feature/index.  However,, if I want to get the top10 most important
> features or all features with weights higher than certain threshold, is
> that functionality available out-of-box? I can implement that on my own,
> but seems like a common feature that most of the people will need when they
> are working on high-dimensional dataset.
>
>
>
>