You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Mark <st...@gmail.com> on 2011/06/24 17:52:54 UTC

Adding dimensions to an existing TF-IDF vector

We are trying to cluster users together by the type of products they 
sell. For this we are building TF-IDF vectors on all of the products 
each user sells. From these vectors we create our clusters and our 
initial results aren't too bad. We would like of course to add some more 
information along side the TF-IDF vectors.. perhaps the categories that 
each user typically sells in.

Is it possible to add more dimensions to an existing TF-IDF vector?  If 
so how would it be possible to determine what appropriate weighting to 
give to these new fields to make sure its not too much/too little?

Thanks for any input

Re: Adding dimensions to an existing TF-IDF vector

Posted by Ted Dunning <te...@gmail.com>.

Although paradoxically, those references don't seem to mention multiple
probes.

In fact, I haven't seen any references for that.  I thought it was obvious,
but apparently not.

On Sat, Jun 25, 2011 at 4:02 AM, Nick Pentreath <ni...@gmail.com>wrote:

> If you want some technical papers etc that cover how (and also why) it
> works, check out http://hunch.net/~jl/projects/hash_reps/index.html
>
>
> On Sat, Jun 25, 2011 at 1:51 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Look at the class FeatureValueEncoder.  The test cases show most of the
> > ways
> > that is used.
> >
> > Also the class TrainNewsGroups in examples.
> >
> > See chapters 14 and 16 of Mahout in Action.  The sample server for
> chapter
> > 16 does encoding like you need.
> >
> > On Fri, Jun 24, 2011 at 5:04 PM, Mark <st...@gmail.com> wrote:
> >
> > > Where can I find out more about this hashed encoding you mentioned?
> > >
> >
>

Re: Adding dimensions to an existing TF-IDF vector

Posted by Nick Pentreath <ni...@gmail.com>.

If you want some technical papers etc that cover how (and also why) it
works, check out http://hunch.net/~jl/projects/hash_reps/index.html


On Sat, Jun 25, 2011 at 1:51 AM, Ted Dunning <te...@gmail.com> wrote:

> Look at the class FeatureValueEncoder.  The test cases show most of the
> ways
> that is used.
>
> Also the class TrainNewsGroups in examples.
>
> See chapters 14 and 16 of Mahout in Action.  The sample server for chapter
> 16 does encoding like you need.
>
> On Fri, Jun 24, 2011 at 5:04 PM, Mark <st...@gmail.com> wrote:
>
> > Where can I find out more about this hashed encoding you mentioned?
> >
>

Re: Adding dimensions to an existing TF-IDF vector

Posted by Ted Dunning <te...@gmail.com>.

Look at the class FeatureValueEncoder.  The test cases show most of the ways
that is used.

Also the class TrainNewsGroups in examples.

See chapters 14 and 16 of Mahout in Action.  The sample server for chapter
16 does encoding like you need.

On Fri, Jun 24, 2011 at 5:04 PM, Mark <st...@gmail.com> wrote:

> Where can I find out more about this hashed encoding you mentioned?
>

Re: Adding dimensions to an existing TF-IDF vector

Posted by Mark <st...@gmail.com>.

Thanks for the suggestions.

Where can I find out more about this hashed encoding you mentioned?

On 6/24/11 10:03 AM, Ted Dunning wrote:
> It is quite possible.
>
> If the new columns represent a relatively small contribution rather than a
> wholesale change in the statistics of the corpus (which is almost always
> true) then you can just add these columns and compute IDF weights for the
> new terms based on the updated corpus statistics.  You don't need to update
> the old IDF weights because the number of documents isn't going to change a
> lot and the old terms probably occur in the new documents at about the same
> rate anyway.
>
> Of course, you do have to go back through an add the zero columns to the old
> data.
>
> One work-around is to use really, really big vectors to start with and hope
> that nobody ever accidentally fills in one of these vectors.  This is cool
> with sparse vectors since zeros aren't store so all of the unused columns
> have no impact.  New vectors can have new columns, but old ones need no
> change since they effectively already have these columns.
>
> A second possible work-around is to use the hashed encoding.  This costs a
> bit more for encoding, but it gives you static vector sizes.  For some
> algorithms, this is a huge win (SGD for example where we need to allocate a
> dense matrix).
>
>
> On Fri, Jun 24, 2011 at 8:52 AM, Mark<st...@gmail.com>  wrote:
>
>> Is it possible to add more dimensions to an existing TF-IDF vector?  If so
>> how would it be possible to determine what appropriate weighting to give to
>> these new fields to make sure its not too much/too little?
>>

Re: Adding dimensions to an existing TF-IDF vector

Posted by Ted Dunning <te...@gmail.com>.

It is quite possible.

If the new columns represent a relatively small contribution rather than a
wholesale change in the statistics of the corpus (which is almost always
true) then you can just add these columns and compute IDF weights for the
new terms based on the updated corpus statistics.  You don't need to update
the old IDF weights because the number of documents isn't going to change a
lot and the old terms probably occur in the new documents at about the same
rate anyway.

Of course, you do have to go back through an add the zero columns to the old
data.

One work-around is to use really, really big vectors to start with and hope
that nobody ever accidentally fills in one of these vectors.  This is cool
with sparse vectors since zeros aren't store so all of the unused columns
have no impact.  New vectors can have new columns, but old ones need no
change since they effectively already have these columns.

A second possible work-around is to use the hashed encoding.  This costs a
bit more for encoding, but it gives you static vector sizes.  For some
algorithms, this is a huge win (SGD for example where we need to allocate a
dense matrix).

On Fri, Jun 24, 2011 at 8:52 AM, Mark <st...@gmail.com> wrote:

> Is it possible to add more dimensions to an existing TF-IDF vector?  If so
> how would it be possible to determine what appropriate weighting to give to
> these new fields to make sure its not too much/too little?
>