You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Ken Krugler <kk...@transpac.com> on 2013/04/13 03:30:47 UTC

Feature reduction for LibLinear weights

Hi all,

We're (ab)using LibLinear (linear SVM) as a multi-class classifier, with 200+ labels and 400K features.

This results in a model that's > 800MB, which is a bit unwieldy. Unfortunately LibLinear uses a full array of weights (nothing sparse), being a port from the C version.

I could do feature reduction (removing rows from the matrix) with Mahout prior to training the model, but I'd prefer to reduce the (in memory) nxm array of weights.

Any suggestions for approaches to take?

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: Feature reduction for LibLinear weights

Posted by Ted Dunning <te...@gmail.com>.
On Wed, Apr 17, 2013 at 2:29 PM, Ken Krugler <kk...@transpac.com>wrote:

> Though I haven't yet found a good write-up on the value of generating more
> than one hash - seems like multiple hash values would increase the odds of
> collisions.


It does.

But it also increases the chances of handling the collision correctly.

Think about Bloom Filters.

Re: Feature reduction for LibLinear weights

Posted by Ken Krugler <kk...@transpac.com>.
Hi Ted,

On Apr 13, 2013, at 8:46pm, Ted Dunning wrote:

> On Sat, Apr 13, 2013 at 7:05 AM, Ken Krugler <kk...@transpac.com>wrote:
> 
>> 
>> On Apr 12, 2013, at 11:55pm, Ted Dunning wrote:
>> 
>>> The first thing to try is feature hashing to reduce your feature vector
>> size.
>> 
>> Unfortunately LibLinear takes feature indices directly (assumes they're
>> sequential ints from 0..n-1), so I don't think feature hashing will help
>> here.
>> 
> 
> I am sure that it would.  The feature indices that you give to liblinear
> don't have to be your original indices.
> 
> The simplest level of feature hashing would be to take the original feature
> indices and use multiple hashing to get 1, 2 or more new feature index
> values for each original index.  Then take these modulo the new feature
> vector size (which can be much smaller than your original).

Thanks for clarifying - I was stuck on using the hash trick to get rid of the terms to index map, versus creating a denser matrix.

Though I haven't yet found a good write-up on the value of generating more than one hash - seems like multiple hash values would increase the odds of collisions.

For a not-so-sparse matrix and a single hash function, I got a 6% drop in accuracy from a single hash. I'll have to try with a more real/sparser data set.

-- Ken

> There will be some collisions, but the result here is a linear
> transformation of the original space and if you use multiple indexes for
> each original feature, you will lose very little, if anything.  The SVM
> will almost always be able to learn around the effects of collisions.

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: Feature reduction for LibLinear weights

Posted by Ted Dunning <te...@gmail.com>.
Glad to be able to help.

Double hashing would probably allow you to preserve full accuracy at higher
compression, but if you are happy, then you might as well be done.


On Wed, Apr 24, 2013 at 1:56 PM, Ken Krugler <kk...@transpac.com>wrote:

> Hi Ted,
>
> On Apr 13, 2013, at 8:46pm, Ted Dunning wrote:
>
> > On Sat, Apr 13, 2013 at 7:05 AM, Ken Krugler <
> kkrugler_lists@transpac.com>wrote:
> >
> >>
> >> On Apr 12, 2013, at 11:55pm, Ted Dunning wrote:
> >>
> >>> The first thing to try is feature hashing to reduce your feature vector
> >> size.
> >>
> >> Unfortunately LibLinear takes feature indices directly (assumes they're
> >> sequential ints from 0..n-1), so I don't think feature hashing will help
> >> here.
> >>
> >
> > I am sure that it would.  The feature indices that you give to liblinear
> > don't have to be your original indices.
> >
> > The simplest level of feature hashing would be to take the original
> feature
> > indices and use multiple hashing to get 1, 2 or more new feature index
> > values for each original index.  Then take these modulo the new feature
> > vector size (which can be much smaller than your original).
>
> I finally got to run this on a full set of training data, and it worked
> really well - even with a single hash function.
>
> Without hashing, I got 81% accuracy on a held-out dataset equal to 10% of
> all documents.
>
> Hashing to 20% of the original size gave me 80% accuracy.
>
> Hashing to 10% gave me 79.6% accuracy - so essentially no change.
>
> Which means my 850MB model is now 81MB.
>
> Thanks for the help!
>
> -- Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>

Re: Feature reduction for LibLinear weights

Posted by Ken Krugler <kk...@transpac.com>.
Hi Ted,

On Apr 13, 2013, at 8:46pm, Ted Dunning wrote:

> On Sat, Apr 13, 2013 at 7:05 AM, Ken Krugler <kk...@transpac.com>wrote:
> 
>> 
>> On Apr 12, 2013, at 11:55pm, Ted Dunning wrote:
>> 
>>> The first thing to try is feature hashing to reduce your feature vector
>> size.
>> 
>> Unfortunately LibLinear takes feature indices directly (assumes they're
>> sequential ints from 0..n-1), so I don't think feature hashing will help
>> here.
>> 
> 
> I am sure that it would.  The feature indices that you give to liblinear
> don't have to be your original indices.
> 
> The simplest level of feature hashing would be to take the original feature
> indices and use multiple hashing to get 1, 2 or more new feature index
> values for each original index.  Then take these modulo the new feature
> vector size (which can be much smaller than your original).

I finally got to run this on a full set of training data, and it worked really well - even with a single hash function.

Without hashing, I got 81% accuracy on a held-out dataset equal to 10% of all documents.

Hashing to 20% of the original size gave me 80% accuracy.

Hashing to 10% gave me 79.6% accuracy - so essentially no change.

Which means my 850MB model is now 81MB.

Thanks for the help!

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: Feature reduction for LibLinear weights

Posted by Ted Dunning <te...@gmail.com>.
On Sat, Apr 13, 2013 at 7:05 AM, Ken Krugler <kk...@transpac.com>wrote:

>
> On Apr 12, 2013, at 11:55pm, Ted Dunning wrote:
>
> > The first thing to try is feature hashing to reduce your feature vector
> size.
>
> Unfortunately LibLinear takes feature indices directly (assumes they're
> sequential ints from 0..n-1), so I don't think feature hashing will help
> here.
>

I am sure that it would.  The feature indices that you give to liblinear
don't have to be your original indices.

The simplest level of feature hashing would be to take the original feature
indices and use multiple hashing to get 1, 2 or more new feature index
values for each original index.  Then take these modulo the new feature
vector size (which can be much smaller than your original).

There will be some collisions, but the result here is a linear
transformation of the original space and if you use multiple indexes for
each original feature, you will lose very little, if anything.  The SVM
will almost always be able to learn around the effects of collisions.


> If I constructed a minimal perfect hash function then I could skip storing
> the mapping from feature to index, but that's not what's taking most of the
> memory; it's the n x m array of weights used by LibLinear.
>

Don't both with perfect hash.  Just use a decent hash.

That n x m array of weights will be 10x smaller if you have n/10.

 > With multiple probes and possibly with random weights you might be able
> to drop the size by 10x.
>
> More details here would be great, sometime when you're not trying to type
> on an iPhone :)
>

Re: Feature reduction for LibLinear weights

Posted by Ken Krugler <kk...@transpac.com>.
On Apr 12, 2013, at 11:55pm, Ted Dunning wrote:

> The first thing to try is feature hashing to reduce your feature vector size.  

Unfortunately LibLinear takes feature indices directly (assumes they're sequential ints from 0..n-1), so I don't think feature hashing will help here.

If I constructed a minimal perfect hash function then I could skip storing the mapping from feature to index, but that's not what's taking most of the memory; it's the n x m array of weights used by LibLinear.

> With multiple probes and possibly with random weights you might be able to drop the size by 10x. 

More details here would be great, sometime when you're not trying to type on an iPhone :)

-- Ken

PS - My initial naive idea was to remove any row where all of the weights were below a threshold that I calculated from the distribution of all weights. 


> 
> Sent from my iPhone
> 
> On Apr 12, 2013, at 18:30, Ken Krugler <kk...@transpac.com> wrote:
> 
>> Hi all,
>> 
>> We're (ab)using LibLinear (linear SVM) as a multi-class classifier, with 200+ labels and 400K features.
>> 
>> This results in a model that's > 800MB, which is a bit unwieldy. Unfortunately LibLinear uses a full array of weights (nothing sparse), being a port from the C version.
>> 
>> I could do feature reduction (removing rows from the matrix) with Mahout prior to training the model, but I'd prefer to reduce the (in memory) nxm array of weights.
>> 
>> Any suggestions for approaches to take?
>> 
>> Thanks,
>> 
>> -- Ken
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 
>> 
>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: Feature reduction for LibLinear weights

Posted by Ted Dunning <te...@gmail.com>.
The first thing to try is feature hashing to reduce your feature vector size.  With multiple probes and possibly with random weights you might be able to drop the size by 10x. 

Sent from my iPhone

On Apr 12, 2013, at 18:30, Ken Krugler <kk...@transpac.com> wrote:

> Hi all,
> 
> We're (ab)using LibLinear (linear SVM) as a multi-class classifier, with 200+ labels and 400K features.
> 
> This results in a model that's > 800MB, which is a bit unwieldy. Unfortunately LibLinear uses a full array of weights (nothing sparse), being a port from the C version.
> 
> I could do feature reduction (removing rows from the matrix) with Mahout prior to training the model, but I'd prefer to reduce the (in memory) nxm array of weights.
> 
> Any suggestions for approaches to take?
> 
> Thanks,
> 
> -- Ken
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 
> 
>