You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Dmitriy Lyubimov <dl...@gmail.com> on 2010/12/30 21:56:06 UTC

seq2sparse and lsi fold-in

Hi,

I would like to try LSI processing of results produced by seq2sparse.

What's more, I need to be able to fold-in a bunch of new documents
afterwards.

Is there any support for fold-in indexing in Mahout?

if not, is there a quick way for me to gain the understanding of seq2sparse
output?
In particular, if i wanted to add fold-in indexing, i need to be able to
produce TF or TF-IDF of the new document on the fly using pre-existing
dictionary and word counts. What's the api for this dictionary?

Thank you.
-Dmitriy

Re: seq2sparse and lsi fold-in

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Thank you, Jake.

Yes, i have figured that, and it seems that DRM.times does just that. I was
just not sure of the production quality of this code. It seems DRM
experiences a lot of fixes and discussions lately, including simple
multiplication.

On a side node one needs to compute  Cx V^t x Sigma^-1 . But i have an
option in stochastic svd command line to compute V x Sigma ^ 0.5 instead of
V and U x Sigma ^ 0.5 instead of U , in which case correction for singular
vectors indeed turns into simple multiplication C x V^t  and singular values
matrix can be ignored . (esp if one may want to measure similarities between
a user and an item, not just user-user or item-item).

-d




On Thu, Jan 6, 2011 at 1:45 PM, Jake Mannix <ja...@gmail.com> wrote:

> Dmitriy,
>
>  I'm not sure if you figured this out on your own and I didn't see the
> email,
> but if not:
>
> On Thu, Dec 30, 2010 at 3:57 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > Also, if i have a bunch of new documents to fold-in, it looks like i'd
> need
> > to run a matrix multiplication job between new document vectors and V,
> both
> > matrices represented row-wise. So DistributedRowMatrix should help me,
> > shouldn't it? do i need to transpose the first matrix first?
> >
>
> If you have a dense matrix V of eigenvectors (ie, it has K (a small number
> like 100's) rows of dense vectors, each of which are cardinality M (which
> may large)), which is a DistributedRowMatrix, and you have your original
> document matrix C, which has N rows, each of which has cardinality M, then
> you actually need to take the transpose of *both* matrices, then take
> the DistributedRowMatrix.times() on these:
>
>  V_transpose = V.transpose();
>  C_transpose = C.transpose();
>  C_times_V_transpose = C_transpose.times(V_transpose);
>
> This code will yield the mathematical result of C * V^T, which is probably
> what you want.
>
> (it turns out that this set of operations could also be done in a custom
> operation
> using the row-paths of both V and C as inputs, but you'd still require two
> MapReduce shuffles to get the answer, so it's not really a savings to do
> this).
>
>  -jake
>

Re: seq2sparse and lsi fold-in

Posted by Jake Mannix <ja...@gmail.com>.
Dmitriy,

  I'm not sure if you figured this out on your own and I didn't see the
email,
but if not:

On Thu, Dec 30, 2010 at 3:57 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Also, if i have a bunch of new documents to fold-in, it looks like i'd need
> to run a matrix multiplication job between new document vectors and V, both
> matrices represented row-wise. So DistributedRowMatrix should help me,
> shouldn't it? do i need to transpose the first matrix first?
>

If you have a dense matrix V of eigenvectors (ie, it has K (a small number
like 100's) rows of dense vectors, each of which are cardinality M (which
may large)), which is a DistributedRowMatrix, and you have your original
document matrix C, which has N rows, each of which has cardinality M, then
you actually need to take the transpose of *both* matrices, then take
the DistributedRowMatrix.times() on these:

  V_transpose = V.transpose();
  C_transpose = C.transpose();
  C_times_V_transpose = C_transpose.times(V_transpose);

This code will yield the mathematical result of C * V^T, which is probably
what you want.

(it turns out that this set of operations could also be done in a custom
operation
using the row-paths of both V and C as inputs, but you'd still require two
MapReduce shuffles to get the answer, so it's not really a savings to do
this).

  -jake

Re: seq2sparse and lsi fold-in

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Also, if i have a bunch of new documents to fold-in, it looks like i'd need
to run a matrix multiplication job between new document vectors and V, both
matrices represented row-wise. So DistributedRowMatrix should help me,
shouldn't it? do i need to transpose the first matrix first?

Thank you once again, your help is really invaluable.

-Dmitriy

On Thu, Dec 30, 2010 at 1:38 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Thank you, Ted.
>
>
> On Thu, Dec 30, 2010 at 1:05 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> The fourth choice is what I would recommend in general unless you need
>> very
>> easy reverse-engineering of your vectors.
>>
>> On Thu, Dec 30, 2010 at 1:04 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>> >
>> > There are two dictionary-like systems in Mahout.  Neither is quite
>> right.
>> >
>> > The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary.
>>  It
>> > doesn't do the frequency counting you want.
>> >
>> > The more complex one is in DictionaryVectorizer.  Unfortunately, it is a
>> > mass of static functions that depend on statically named files rather
>> than
>> > being a real API.
>> >
>> > There is a third choice as well
>> > in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder.  It
>> does
>> > on-line IDF weighting and can be used underneath a text encoder to get
>> > on-line TF-IDF weighting of the sort you desire.  You can preset counts
>> > using the getDictionary accessor.
>> >
>> > A fourth choice is to simply use a static word encoder with hashed
>> vectors
>> > and do the IDF weighting as a vector element-wise multiplication.  That
>> way
>> > you only need to keep around a vector of weights and no dictionary.
>>  That
>> > should be much cheaper in memory.
>> >
>> >
>> > On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> >wrote:
>> >
>> >> Hi,
>> >>
>> >> I would like to try LSI processing of results produced by seq2sparse.
>> >>
>> >> What's more, I need to be able to fold-in a bunch of new documents
>> >> afterwards.
>> >>
>> >> Is there any support for fold-in indexing in Mahout?
>> >>
>> >> if not, is there a quick way for me to gain the understanding of
>> >> seq2sparse
>> >> output?
>> >> In particular, if i wanted to add fold-in indexing, i need to be able
>> to
>> >> produce TF or TF-IDF of the new document on the fly using pre-existing
>> >> dictionary and word counts. What's the api for this dictionary?
>> >>
>> >> Thank you.
>> >> -Dmitriy
>> >>
>> >
>> >
>>
>
>

Re: seq2sparse and lsi fold-in

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Thank you, Ted.

On Thu, Dec 30, 2010 at 1:05 PM, Ted Dunning <te...@gmail.com> wrote:

> The fourth choice is what I would recommend in general unless you need very
> easy reverse-engineering of your vectors.
>
> On Thu, Dec 30, 2010 at 1:04 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> >
> > There are two dictionary-like systems in Mahout.  Neither is quite right.
> >
> > The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary.
>  It
> > doesn't do the frequency counting you want.
> >
> > The more complex one is in DictionaryVectorizer.  Unfortunately, it is a
> > mass of static functions that depend on statically named files rather
> than
> > being a real API.
> >
> > There is a third choice as well
> > in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder.  It
> does
> > on-line IDF weighting and can be used underneath a text encoder to get
> > on-line TF-IDF weighting of the sort you desire.  You can preset counts
> > using the getDictionary accessor.
> >
> > A fourth choice is to simply use a static word encoder with hashed
> vectors
> > and do the IDF weighting as a vector element-wise multiplication.  That
> way
> > you only need to keep around a vector of weights and no dictionary.  That
> > should be much cheaper in memory.
> >
> >
> > On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
> >
> >> Hi,
> >>
> >> I would like to try LSI processing of results produced by seq2sparse.
> >>
> >> What's more, I need to be able to fold-in a bunch of new documents
> >> afterwards.
> >>
> >> Is there any support for fold-in indexing in Mahout?
> >>
> >> if not, is there a quick way for me to gain the understanding of
> >> seq2sparse
> >> output?
> >> In particular, if i wanted to add fold-in indexing, i need to be able to
> >> produce TF or TF-IDF of the new document on the fly using pre-existing
> >> dictionary and word counts. What's the api for this dictionary?
> >>
> >> Thank you.
> >> -Dmitriy
> >>
> >
> >
>

Re: seq2sparse and lsi fold-in

Posted by Ted Dunning <te...@gmail.com>.
The fourth choice is what I would recommend in general unless you need very
easy reverse-engineering of your vectors.

On Thu, Dec 30, 2010 at 1:04 PM, Ted Dunning <te...@gmail.com> wrote:

>
> There are two dictionary-like systems in Mahout.  Neither is quite right.
>
> The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary.  It
> doesn't do the frequency counting you want.
>
> The more complex one is in DictionaryVectorizer.  Unfortunately, it is a
> mass of static functions that depend on statically named files rather than
> being a real API.
>
> There is a third choice as well
> in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder.  It does
> on-line IDF weighting and can be used underneath a text encoder to get
> on-line TF-IDF weighting of the sort you desire.  You can preset counts
> using the getDictionary accessor.
>
> A fourth choice is to simply use a static word encoder with hashed vectors
> and do the IDF weighting as a vector element-wise multiplication.  That way
> you only need to keep around a vector of weights and no dictionary.  That
> should be much cheaper in memory.
>
>
> On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>
>> Hi,
>>
>> I would like to try LSI processing of results produced by seq2sparse.
>>
>> What's more, I need to be able to fold-in a bunch of new documents
>> afterwards.
>>
>> Is there any support for fold-in indexing in Mahout?
>>
>> if not, is there a quick way for me to gain the understanding of
>> seq2sparse
>> output?
>> In particular, if i wanted to add fold-in indexing, i need to be able to
>> produce TF or TF-IDF of the new document on the fly using pre-existing
>> dictionary and word counts. What's the api for this dictionary?
>>
>> Thank you.
>> -Dmitriy
>>
>
>

Re: seq2sparse and lsi fold-in

Posted by Ted Dunning <te...@gmail.com>.
There are two dictionary-like systems in Mahout.  Neither is quite right.

The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary.  It
doesn't do the frequency counting you want.

The more complex one is in DictionaryVectorizer.  Unfortunately, it is a
mass of static functions that depend on statically named files rather than
being a real API.

There is a third choice as well
in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder.  It does
on-line IDF weighting and can be used underneath a text encoder to get
on-line TF-IDF weighting of the sort you desire.  You can preset counts
using the getDictionary accessor.

A fourth choice is to simply use a static word encoder with hashed vectors
and do the IDF weighting as a vector element-wise multiplication.  That way
you only need to keep around a vector of weights and no dictionary.  That
should be much cheaper in memory.


On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> Hi,
>
> I would like to try LSI processing of results produced by seq2sparse.
>
> What's more, I need to be able to fold-in a bunch of new documents
> afterwards.
>
> Is there any support for fold-in indexing in Mahout?
>
> if not, is there a quick way for me to gain the understanding of seq2sparse
> output?
> In particular, if i wanted to add fold-in indexing, i need to be able to
> produce TF or TF-IDF of the new document on the fly using pre-existing
> dictionary and word counts. What's the api for this dictionary?
>
> Thank you.
> -Dmitriy
>

Re: seq2sparse and lsi fold-in

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
PS. i've already been reading thru SparseVectorsFromSequenceFiles.java, just
trying to figure if can do it faster by taking advice for more starting
points to look at.

Thanks in advance.
-Dmitriy

On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> Hi,
>
> I would like to try LSI processing of results produced by seq2sparse.
>
> What's more, I need to be able to fold-in a bunch of new documents
> afterwards.
>
> Is there any support for fold-in indexing in Mahout?
>
> if not, is there a quick way for me to gain the understanding of seq2sparse
> output?
> In particular, if i wanted to add fold-in indexing, i need to be able to
> produce TF or TF-IDF of the new document on the fly using pre-existing
> dictionary and word counts. What's the api for this dictionary?
>
> Thank you.
> -Dmitriy
>