You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Timothy Potter <th...@gmail.com> on 2011/03/14 02:47:44 UTC

Need a little help with using SVD

Looking for a little clarification with using SVD to reduce dimensions of my
vectors for clustering ...

Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf vectors
with 20,444 dimensions. I successfully run Mahout SVD on the vectors using:

bin/mahout svd -i
/asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \
    -o /asf-mail-archives/mahout-0.4/svd \
    --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true

This produced 87 eigenvectors of size 20,444. I'm not clear as to why only
87, but I'm assuming that has something to do with Lanczos???

So then I proceeded to transpose the SVD output using:

bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444
--numRows 87

Next, I tried to run transpose on my original vectors using:

transpose -i /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
--numCols 20444 --numRows 6076937

This failed with error:

java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast
to org.apache.hadoop.io.IntWritable
	at org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:100)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:312)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)

So I think I'm missing something ... I'm basing my process on the steps
outlined in thread:
http://lucene.472066.n3.nabble.com/Using-SVD-with-Canopy-KMeans-td1407217.html,
i.e.

bin/*mahout* *svd* (original -> *svdOut*)
bin/*mahout* cleansvd ...
bin/*mahout* *transpose* *svdOut* -> *svdT*
bin/*mahout* *transpose* original -> originalT
bin/*mahout* matrixmult originalT *svdT* -> newMatrix
bin/*mahout* kmeans newMatrix

Based on Ted's last comment in that thread, it seems like I may not need to
transpose the original matrix? Just want to be sure this process is correct.

Re: Need a little help with using SVD

Posted by Lance Norskog <go...@gmail.com>.

Oops, hijacked. Starting a new thread.

On Sun, Mar 13, 2011 at 8:23 PM, Lance Norskog <go...@gmail.com> wrote:
> What down-projection techniques are available in Mahout, and what
> others would be useful? For example, I'm intrigued by the
> manifold-finders like ISOMAP.
>
> Lance
>
> On Sun, Mar 13, 2011 at 8:18 PM, Ted Dunning <te...@gmail.com> wrote:
>> For clustering purposes, you probably don't even need SVD here.  You can
>> project randomly down to 100-200 dimensions and do the clustering.  You have
>> to use a higher number of dimensions than you would with SVD, but avoiding
>> the SVD is a big win.  Depending on the density of your data, this may or
>> may not make clustering faster.  The key question is whether the total data
>> size is larger or smaller.
>>
>> Also, since your data is essentially count data, you have large amounts of
>> noise which probably make everything after about 20-30 singular vectors into
>> random noise anyway.  As such, I recommend replacing later singular vectors
>> with random numbers anyway.  These will be quasi-orthogonal and thus pretty
>> much as good as real singular vectors for reducing dimensionality, not quite
>> so good as a minimal basis.
>>
>> On Sun, Mar 13, 2011 at 6:47 PM, Timothy Potter <th...@gmail.com>wrote:
>>
>>> Looking for a little clarification with using SVD to reduce dimensions of
>>> my
>>> vectors for clustering ...
>>>
>>> Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf vectors
>>> with 20,444 dimensions. I successfully run Mahout SVD on the vectors using:
>>>
>>> bin/mahout svd -i
>>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \
>>>    -o /asf-mail-archives/mahout-0.4/svd \
>>>    --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true
>>>
>>> This produced 87 eigenvectors of size 20,444. I'm not clear as to why only
>>> 87, but I'm assuming that has something to do with Lanczos???
>>>
>>> So then I proceeded to transpose the SVD output using:
>>>
>>> bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444
>>> --numRows 87
>>>
>>> Next, I tried to run transpose on my original vectors using:
>>>
>>> transpose -i /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
>>> --numCols 20444 --numRows 6076937
>>>
>>> This failed with error:
>>>
>>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast
>>> to org.apache.hadoop.io.IntWritable
>>>        at
>>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:100)
>>>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363)
>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:312)
>>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>
>>> So I think I'm missing something ... I'm basing my process on the steps
>>> outlined in thread:
>>>
>>> http://lucene.472066.n3.nabble.com/Using-SVD-with-Canopy-KMeans-td1407217.html
>>> ,
>>> i.e.
>>>
>>> bin/*mahout* *svd* (original -> *svdOut*)
>>> bin/*mahout* cleansvd ...
>>> bin/*mahout* *transpose* *svdOut* -> *svdT*
>>> bin/*mahout* *transpose* original -> originalT
>>> bin/*mahout* matrixmult originalT *svdT* -> newMatrix
>>> bin/*mahout* kmeans newMatrix
>>>
>>> Based on Ted's last comment in that thread, it seems like I may not need to
>>> transpose the original matrix? Just want to be sure this process is
>>> correct.
>>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 
Lance Norskog
goksron@gmail.com

Re: Need a little help with using SVD

Posted by Lance Norskog <go...@gmail.com>.

What down-projection techniques are available in Mahout, and what
others would be useful? For example, I'm intrigued by the
manifold-finders like ISOMAP.

Lance

On Sun, Mar 13, 2011 at 8:18 PM, Ted Dunning <te...@gmail.com> wrote:
> For clustering purposes, you probably don't even need SVD here.  You can
> project randomly down to 100-200 dimensions and do the clustering.  You have
> to use a higher number of dimensions than you would with SVD, but avoiding
> the SVD is a big win.  Depending on the density of your data, this may or
> may not make clustering faster.  The key question is whether the total data
> size is larger or smaller.
>
> Also, since your data is essentially count data, you have large amounts of
> noise which probably make everything after about 20-30 singular vectors into
> random noise anyway.  As such, I recommend replacing later singular vectors
> with random numbers anyway.  These will be quasi-orthogonal and thus pretty
> much as good as real singular vectors for reducing dimensionality, not quite
> so good as a minimal basis.
>
> On Sun, Mar 13, 2011 at 6:47 PM, Timothy Potter <th...@gmail.com>wrote:
>
>> Looking for a little clarification with using SVD to reduce dimensions of
>> my
>> vectors for clustering ...
>>
>> Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf vectors
>> with 20,444 dimensions. I successfully run Mahout SVD on the vectors using:
>>
>> bin/mahout svd -i
>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \
>>    -o /asf-mail-archives/mahout-0.4/svd \
>>    --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true
>>
>> This produced 87 eigenvectors of size 20,444. I'm not clear as to why only
>> 87, but I'm assuming that has something to do with Lanczos???
>>
>> So then I proceeded to transpose the SVD output using:
>>
>> bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444
>> --numRows 87
>>
>> Next, I tried to run transpose on my original vectors using:
>>
>> transpose -i /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
>> --numCols 20444 --numRows 6076937
>>
>> This failed with error:
>>
>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast
>> to org.apache.hadoop.io.IntWritable
>>        at
>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:100)
>>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:312)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>
>> So I think I'm missing something ... I'm basing my process on the steps
>> outlined in thread:
>>
>> http://lucene.472066.n3.nabble.com/Using-SVD-with-Canopy-KMeans-td1407217.html
>> ,
>> i.e.
>>
>> bin/*mahout* *svd* (original -> *svdOut*)
>> bin/*mahout* cleansvd ...
>> bin/*mahout* *transpose* *svdOut* -> *svdT*
>> bin/*mahout* *transpose* original -> originalT
>> bin/*mahout* matrixmult originalT *svdT* -> newMatrix
>> bin/*mahout* kmeans newMatrix
>>
>> Based on Ted's last comment in that thread, it seems like I may not need to
>> transpose the original matrix? Just want to be sure this process is
>> correct.
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Need a little help with using SVD

Posted by Ted Dunning <te...@gmail.com>.

Yeah... SVD is pretty expensive compared to the lightning fast integration
of random projection into a map-phase.

On Sun, Mar 13, 2011 at 8:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> On Sun, Mar 13, 2011 at 8:18 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > For clustering purposes, you probably don't even need SVD here.  You can
> > project randomly down to 100-200 dimensions and do the clustering.  You
> have
> > to use a higher number of dimensions than you would with SVD, but
> avoiding
> > the SVD is a big win.
>
> Is it because of running time (efficiency) of SVD or there's something
> else?
>
> Thanks.
>

Re: Need a little help with using SVD

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Sun, Mar 13, 2011 at 8:18 PM, Ted Dunning <te...@gmail.com> wrote:
> For clustering purposes, you probably don't even need SVD here.  You can
> project randomly down to 100-200 dimensions and do the clustering.  You have
> to use a higher number of dimensions than you would with SVD, but avoiding
> the SVD is a big win.

Is it because of running time (efficiency) of SVD or there's something else?

Thanks.

Re: Need a little help with using SVD

Posted by Ted Dunning <te...@gmail.com>.

For clustering purposes, you probably don't even need SVD here.  You can
project randomly down to 100-200 dimensions and do the clustering.  You have
to use a higher number of dimensions than you would with SVD, but avoiding
the SVD is a big win.  Depending on the density of your data, this may or
may not make clustering faster.  The key question is whether the total data
size is larger or smaller.

Also, since your data is essentially count data, you have large amounts of
noise which probably make everything after about 20-30 singular vectors into
random noise anyway.  As such, I recommend replacing later singular vectors
with random numbers anyway.  These will be quasi-orthogonal and thus pretty
much as good as real singular vectors for reducing dimensionality, not quite
so good as a minimal basis.

On Sun, Mar 13, 2011 at 6:47 PM, Timothy Potter <th...@gmail.com>wrote:

> Looking for a little clarification with using SVD to reduce dimensions of
> my
> vectors for clustering ...
>
> Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf vectors
> with 20,444 dimensions. I successfully run Mahout SVD on the vectors using:
>
> bin/mahout svd -i
> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \
>    -o /asf-mail-archives/mahout-0.4/svd \
>    --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true
>
> This produced 87 eigenvectors of size 20,444. I'm not clear as to why only
> 87, but I'm assuming that has something to do with Lanczos???
>
> So then I proceeded to transpose the SVD output using:
>
> bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444
> --numRows 87
>
> Next, I tried to run transpose on my original vectors using:
>
> transpose -i /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
> --numCols 20444 --numRows 6076937
>
> This failed with error:
>
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast
> to org.apache.hadoop.io.IntWritable
>        at
> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:100)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:312)
>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> So I think I'm missing something ... I'm basing my process on the steps
> outlined in thread:
>
> http://lucene.472066.n3.nabble.com/Using-SVD-with-Canopy-KMeans-td1407217.html
> ,
> i.e.
>
> bin/*mahout* *svd* (original -> *svdOut*)
> bin/*mahout* cleansvd ...
> bin/*mahout* *transpose* *svdOut* -> *svdT*
> bin/*mahout* *transpose* original -> originalT
> bin/*mahout* matrixmult originalT *svdT* -> newMatrix
> bin/*mahout* kmeans newMatrix
>
> Based on Ted's last comment in that thread, it seems like I may not need to
> transpose the original matrix? Just want to be sure this process is
> correct.
>

Re: Need a little help with using SVD

Posted by Ted Dunning <te...@gmail.com>.

Good man.

On Sat, Mar 19, 2011 at 8:22 AM, Timothy Potter <th...@gmail.com>wrote:

> I'd like to understand the hashed vector stuff better so will take on the
> task of creating a "clean command line integration from text => hashed
> vector => clusters" ... just as soon as I finish our write-up for 588 ;-)
>
> On Fri, Mar 18, 2011 at 10:50 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > We have the encoders and the resulting vectors should cluster as easily
> as
> > anything.
> >
> > What we don't have is a clean command line integration from text =>
> hashed
> > vector => clusters
> >
> >
> > On Fri, Mar 18, 2011 at 9:44 AM, Grant Ingersoll <gsingers@apache.org
> >wrote:
> >
> >> > Another option is to use hashed feature vectors.  These will retain
> >> > essentially all of the data of the larger vectors but will allow your
> >> > centroids to be more moderate in size.  This also helps in not
> requiring
> >> a
> >> > pass over your data to assign vector locations.
> >>
> >> Do we have code for using this with our existing algorithms?
> >
> >
> >
>

Re: Need a little help with using SVD

Posted by Timothy Potter <th...@gmail.com>.

I'd like to understand the hashed vector stuff better so will take on the
task of creating a "clean command line integration from text => hashed
vector => clusters" ... just as soon as I finish our write-up for 588 ;-)

On Fri, Mar 18, 2011 at 10:50 AM, Ted Dunning <te...@gmail.com> wrote:

> We have the encoders and the resulting vectors should cluster as easily as
> anything.
>
> What we don't have is a clean command line integration from text => hashed
> vector => clusters
>
>
> On Fri, Mar 18, 2011 at 9:44 AM, Grant Ingersoll <gs...@apache.org>wrote:
>
>> > Another option is to use hashed feature vectors.  These will retain
>> > essentially all of the data of the larger vectors but will allow your
>> > centroids to be more moderate in size.  This also helps in not requiring
>> a
>> > pass over your data to assign vector locations.
>>
>> Do we have code for using this with our existing algorithms?
>
>
>

Re: Need a little help with using SVD

Posted by Ted Dunning <te...@gmail.com>.

We have the encoders and the resulting vectors should cluster as easily as
anything.

What we don't have is a clean command line integration from text => hashed
vector => clusters

On Fri, Mar 18, 2011 at 9:44 AM, Grant Ingersoll <gs...@apache.org>wrote:

> > Another option is to use hashed feature vectors.  These will retain
> > essentially all of the data of the larger vectors but will allow your
> > centroids to be more moderate in size.  This also helps in not requiring
> a
> > pass over your data to assign vector locations.
>
> Do we have code for using this with our existing algorithms?

Re: Need a little help with using SVD

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 17, 2011, at 12:57 PM, Ted Dunning wrote:

> I think that 400K dimensions isn't a big problem if you have enough data so
> that you start to see good overlap.  If your bigrams are sparse enough that
> you don't get overlap, then this won't work.  SVD could be very helpful to
> smooth the data in those cases.

I had recommended to Tim to try SVD based on an earlier discussion that Jake, you and I had.

It's interesting that some of our clustering algorithms aren't holding up well under the work that Tim and Syzmon are doing on MAHOUT-588.  I suspect there are things we should be doing better in the implementation and that the fundamental approach is still good.  We need some good profiling tools, I suspect.  Tim and Syzmon are still writing it up, but I would be interested in figuring out how we can fix the issues.

> 
> The biggest implementation problem will the storage required for the
> centroid vectors if you have too if number of clusters x vector size gets
> too large.
> 
> Another option is to use hashed feature vectors.  These will retain
> essentially all of the data of the larger vectors but will allow your
> centroids to be more moderate in size.  This also helps in not requiring a
> pass over your data to assign vector locations.

Do we have code for using this with our existing algorithms?  


> 
> On Thu, Mar 17, 2011 at 8:52 AM, Timothy Potter <th...@gmail.com>wrote:
> 
>> Hi Ted,
>> 
>> Regarding your comment: "For clustering purposes, you probably don't even
>> need SVD here ..."
>> 
>> I was just experimenting with vectors that have 20K dimensions, but my end
>> goal was to run the SVD on vectors with n-grams that have roughly 380K
>> dimensions. Do you still think SVD is not needed for this situation? My
>> thought was to get the n-gram vectors down to a more manageable size and
>> SVD
>> seemed like what I needed?
>> 
>> Cheers,
>> Tim
>> 
>> On Mon, Mar 14, 2011 at 3:56 PM, Timothy Potter <thelabdude@gmail.com
>>> wrote:
>> 
>>> Thanks for the clarification Jake.
>>> 
>>> The end goal is to run the SVD against my n-gram vector, which have 380K
>>> dimensions.
>>> 
>>> I'll update the wiki once I have this working.
>>> 
>>> Tim
>>> 
>>> 
>>> On Mon, Mar 14, 2011 at 1:09 PM, Jake Mannix <jake.mannix@gmail.com
>>> wrote:
>>> 
>>>> 
>>>> On Sun, Mar 13, 2011 at 6:47 PM, Timothy Potter <thelabdude@gmail.com
>>> wrote:
>>>> 
>>>>> Looking for a little clarification with using SVD to reduce dimensions
>> of
>>>>> my
>>>>> vectors for clustering ...
>>>>> 
>>>>> Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf
>>>>> vectors
>>>>> with 20,444 dimensions. I successfully run Mahout SVD on the vectors
>>>>> using:
>>>>> 
>>>>> bin/mahout svd -i
>>>>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \
>>>>>   -o /asf-mail-archives/mahout-0.4/svd \
>>>>>   --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true
>>>>> 
>>>>> This produced 87 eigenvectors of size 20,444. I'm not clear as to why
>>>>> only
>>>>> 87, but I'm assuming that has something to do with Lanczos???
>>>>> 
>>>> 
>>>> Hi Timothy,
>>>> 
>>>>  The LanczosSolver looks for 100 eigenvectors, but then does some
>> cleanup
>>>> after
>>>> the fact: convergence issues and numeric overflow can cause some
>>>> eigenvectors
>>>> to show up twice - the last step in Mahout SVD is to remove these
>>>> spurious
>>>> eigenvectors (and also any which just don't appear to be "eigen" enough
>>>> (ie,
>>>> they don't satisfy the eigenvector criterion with high enough fidelity).
>>>> 
>>>>  If you really need more eigenvectors, you can try re-running with
>>>> rank=150,
>>>> and then take the top 100 out of however many you get out.
>>>> 
>>>> So then I proceeded to transpose the SVD output using:
>>>>> 
>>>>> bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444
>>>>> --numRows 87
>>>>> 
>>>>> Next, I tried to run transpose on my original vectors using:
>>>>> 
>>>>> transpose -i
>>>>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
>>>>> --numCols 20444 --numRows 6076937
>>>>> 
>>>>> 
>>>> So the problems with this is that the tfidf-vectors is a
>>>> SequenceFile<Text,VectorWritable> - which is fine for input into
>>>> DistributedLanczosSolver (which just needs <Writable,VectorWritable>
>>>> pairs),
>>>> but not so fine for being really considered a "matrix" - you need to run
>>>> the
>>>> RowIdJob on these tfidf-vectors first.  This will normalize your
>>>> SequenceFIle<Text,VectorWritable> into a
>>>> SequenceFile<IntWritable,VectorWritable>
>>>> and a SequenceFIle<IntWritable,Text> (where original one is the join of
>>>> these new ones, on the new int key).
>>>> 
>>>> Hope that helps.
>>>> 
>>>>  -jake
>>>> 
>>> 
>>> 
>> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Need a little help with using SVD

Posted by Ted Dunning <te...@gmail.com>.

I think that 400K dimensions isn't a big problem if you have enough data so
that you start to see good overlap.  If your bigrams are sparse enough that
you don't get overlap, then this won't work.  SVD could be very helpful to
smooth the data in those cases.

The biggest implementation problem will the storage required for the
centroid vectors if you have too if number of clusters x vector size gets
too large.

Another option is to use hashed feature vectors.  These will retain
essentially all of the data of the larger vectors but will allow your
centroids to be more moderate in size.  This also helps in not requiring a
pass over your data to assign vector locations.

On Thu, Mar 17, 2011 at 8:52 AM, Timothy Potter <th...@gmail.com>wrote:

> Hi Ted,
>
> Regarding your comment: "For clustering purposes, you probably don't even
> need SVD here ..."
>
> I was just experimenting with vectors that have 20K dimensions, but my end
> goal was to run the SVD on vectors with n-grams that have roughly 380K
> dimensions. Do you still think SVD is not needed for this situation? My
> thought was to get the n-gram vectors down to a more manageable size and
> SVD
> seemed like what I needed?
>
> Cheers,
> Tim
>
> On Mon, Mar 14, 2011 at 3:56 PM, Timothy Potter <thelabdude@gmail.com
> >wrote:
>
> > Thanks for the clarification Jake.
> >
> > The end goal is to run the SVD against my n-gram vector, which have 380K
> > dimensions.
> >
> > I'll update the wiki once I have this working.
> >
> > Tim
> >
> >
> > On Mon, Mar 14, 2011 at 1:09 PM, Jake Mannix <jake.mannix@gmail.com
> >wrote:
> >
> >>
> >> On Sun, Mar 13, 2011 at 6:47 PM, Timothy Potter <thelabdude@gmail.com
> >wrote:
> >>
> >>> Looking for a little clarification with using SVD to reduce dimensions
> of
> >>> my
> >>> vectors for clustering ...
> >>>
> >>> Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf
> >>> vectors
> >>> with 20,444 dimensions. I successfully run Mahout SVD on the vectors
> >>> using:
> >>>
> >>> bin/mahout svd -i
> >>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \
> >>>    -o /asf-mail-archives/mahout-0.4/svd \
> >>>    --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true
> >>>
> >>> This produced 87 eigenvectors of size 20,444. I'm not clear as to why
> >>> only
> >>> 87, but I'm assuming that has something to do with Lanczos???
> >>>
> >>
> >> Hi Timothy,
> >>
> >>   The LanczosSolver looks for 100 eigenvectors, but then does some
> cleanup
> >> after
> >> the fact: convergence issues and numeric overflow can cause some
> >> eigenvectors
> >> to show up twice - the last step in Mahout SVD is to remove these
> >> spurious
> >> eigenvectors (and also any which just don't appear to be "eigen" enough
> >> (ie,
> >> they don't satisfy the eigenvector criterion with high enough fidelity).
> >>
> >>   If you really need more eigenvectors, you can try re-running with
> >> rank=150,
> >> and then take the top 100 out of however many you get out.
> >>
> >> So then I proceeded to transpose the SVD output using:
> >>>
> >>> bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444
> >>> --numRows 87
> >>>
> >>> Next, I tried to run transpose on my original vectors using:
> >>>
> >>> transpose -i
> >>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
> >>> --numCols 20444 --numRows 6076937
> >>>
> >>>
> >> So the problems with this is that the tfidf-vectors is a
> >> SequenceFile<Text,VectorWritable> - which is fine for input into
> >> DistributedLanczosSolver (which just needs <Writable,VectorWritable>
> >> pairs),
> >> but not so fine for being really considered a "matrix" - you need to run
> >> the
> >> RowIdJob on these tfidf-vectors first.  This will normalize your
> >> SequenceFIle<Text,VectorWritable> into a
> >> SequenceFile<IntWritable,VectorWritable>
> >> and a SequenceFIle<IntWritable,Text> (where original one is the join of
> >> these new ones, on the new int key).
> >>
> >> Hope that helps.
> >>
> >>   -jake
> >>
> >
> >
>

Re: Need a little help with using SVD

Posted by Timothy Potter <th...@gmail.com>.

Hi Ted,

Regarding your comment: "For clustering purposes, you probably don't even
need SVD here ..."

I was just experimenting with vectors that have 20K dimensions, but my end
goal was to run the SVD on vectors with n-grams that have roughly 380K
dimensions. Do you still think SVD is not needed for this situation? My
thought was to get the n-gram vectors down to a more manageable size and SVD
seemed like what I needed?

Cheers,
Tim

On Mon, Mar 14, 2011 at 3:56 PM, Timothy Potter <th...@gmail.com>wrote:

> Thanks for the clarification Jake.
>
> The end goal is to run the SVD against my n-gram vector, which have 380K
> dimensions.
>
> I'll update the wiki once I have this working.
>
> Tim
>
>
> On Mon, Mar 14, 2011 at 1:09 PM, Jake Mannix <ja...@gmail.com>wrote:
>
>>
>> On Sun, Mar 13, 2011 at 6:47 PM, Timothy Potter <th...@gmail.com>wrote:
>>
>>> Looking for a little clarification with using SVD to reduce dimensions of
>>> my
>>> vectors for clustering ...
>>>
>>> Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf
>>> vectors
>>> with 20,444 dimensions. I successfully run Mahout SVD on the vectors
>>> using:
>>>
>>> bin/mahout svd -i
>>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \
>>>    -o /asf-mail-archives/mahout-0.4/svd \
>>>    --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true
>>>
>>> This produced 87 eigenvectors of size 20,444. I'm not clear as to why
>>> only
>>> 87, but I'm assuming that has something to do with Lanczos???
>>>
>>
>> Hi Timothy,
>>
>>   The LanczosSolver looks for 100 eigenvectors, but then does some cleanup
>> after
>> the fact: convergence issues and numeric overflow can cause some
>> eigenvectors
>> to show up twice - the last step in Mahout SVD is to remove these
>> spurious
>> eigenvectors (and also any which just don't appear to be "eigen" enough
>> (ie,
>> they don't satisfy the eigenvector criterion with high enough fidelity).
>>
>>   If you really need more eigenvectors, you can try re-running with
>> rank=150,
>> and then take the top 100 out of however many you get out.
>>
>> So then I proceeded to transpose the SVD output using:
>>>
>>> bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444
>>> --numRows 87
>>>
>>> Next, I tried to run transpose on my original vectors using:
>>>
>>> transpose -i
>>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
>>> --numCols 20444 --numRows 6076937
>>>
>>>
>> So the problems with this is that the tfidf-vectors is a
>> SequenceFile<Text,VectorWritable> - which is fine for input into
>> DistributedLanczosSolver (which just needs <Writable,VectorWritable>
>> pairs),
>> but not so fine for being really considered a "matrix" - you need to run
>> the
>> RowIdJob on these tfidf-vectors first.  This will normalize your
>> SequenceFIle<Text,VectorWritable> into a
>> SequenceFile<IntWritable,VectorWritable>
>> and a SequenceFIle<IntWritable,Text> (where original one is the join of
>> these new ones, on the new int key).
>>
>> Hope that helps.
>>
>>   -jake
>>
>
>

Re: Need a little help with using SVD

Posted by Timothy Potter <th...@gmail.com>.

Thanks for the clarification Jake.

The end goal is to run the SVD against my n-gram vector, which have 380K
dimensions.

I'll update the wiki once I have this working.

Tim

On Mon, Mar 14, 2011 at 1:09 PM, Jake Mannix <ja...@gmail.com> wrote:

>
> On Sun, Mar 13, 2011 at 6:47 PM, Timothy Potter <th...@gmail.com>wrote:
>
>> Looking for a little clarification with using SVD to reduce dimensions of
>> my
>> vectors for clustering ...
>>
>> Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf vectors
>> with 20,444 dimensions. I successfully run Mahout SVD on the vectors
>> using:
>>
>> bin/mahout svd -i
>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \
>>    -o /asf-mail-archives/mahout-0.4/svd \
>>    --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true
>>
>> This produced 87 eigenvectors of size 20,444. I'm not clear as to why only
>> 87, but I'm assuming that has something to do with Lanczos???
>>
>
> Hi Timothy,
>
>   The LanczosSolver looks for 100 eigenvectors, but then does some cleanup
> after
> the fact: convergence issues and numeric overflow can cause some
> eigenvectors
> to show up twice - the last step in Mahout SVD is to remove these spurious
> eigenvectors (and also any which just don't appear to be "eigen" enough
> (ie,
> they don't satisfy the eigenvector criterion with high enough fidelity).
>
>   If you really need more eigenvectors, you can try re-running with
> rank=150,
> and then take the top 100 out of however many you get out.
>
> So then I proceeded to transpose the SVD output using:
>>
>> bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444
>> --numRows 87
>>
>> Next, I tried to run transpose on my original vectors using:
>>
>> transpose -i
>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
>> --numCols 20444 --numRows 6076937
>>
>>
> So the problems with this is that the tfidf-vectors is a
> SequenceFile<Text,VectorWritable> - which is fine for input into
> DistributedLanczosSolver (which just needs <Writable,VectorWritable>
> pairs),
> but not so fine for being really considered a "matrix" - you need to run
> the
> RowIdJob on these tfidf-vectors first.  This will normalize your
> SequenceFIle<Text,VectorWritable> into a
> SequenceFile<IntWritable,VectorWritable>
> and a SequenceFIle<IntWritable,Text> (where original one is the join of
> these new ones, on the new int key).
>
> Hope that helps.
>
>   -jake
>

Re: Need a little help with using SVD

Posted by Ted Dunning <te...@gmail.com>.

Well done Tim!

On Tue, Mar 29, 2011 at 5:39 PM, Timothy Potter <th...@gmail.com>wrote:

> Hi Jake,
>
> Success! I implemented a basic Element Comparator and sorted the random
> vector data before adding to the SequentialAccessSparseVector as you
> recommended and the TransposeJob ripped through my data in about 5 mins! My
> implementation is basic at this point relying on a custom
> Comparator<Element> and Arrays.sort() vs. the optimal way you suggested, but
> gets the job done and is actually pretty fast ... I'll post a patch after
> I've added some test cases for this.
>
> Thanks again for your help.
>
> Cheers,
> Tim
>
> On Tue, Mar 29, 2011 at 7:29 AM, Jake Mannix <ja...@gmail.com>wrote:
>
>> Riding in the cab to the airport, I just realized that this makes the
>> perfect interview question: the optimal way to do this constructor only
>> makes two new primitive arrays, and forces you to essentially reimplement
>> the basic in-place sort of the java library - you can't reuse the one in
>> Arrays or Collections because you do want it to act on two arrays at once.
>>
>> I always feel guilty having people reimplement sort(), but here you
>> basically have to!  In the real world, you can cut and paste, but it gives a
>> totally plausible backstory to the question in an interview.
>>
>>   -jake
>>
>> On Mar 29, 2011 6:15 AM, "Timothy Potter" <th...@gmail.com> wrote:
>>
>> Thanks for the pointer!
>>
>> I created issue #639 in JIRA to track this -- I'll refactor the code using
>> Jake's suggestions and post a patch later today.
>>
>> https://issues.apache.org/jira/browse/MAHOUT-639
>>
>> Best regards,
>> Tim
>>
>> On Tue, Mar 29, 2011 at 12:37 AM, Jake Mannix <ja...@gmail.com>
>> wrote: > > Suspicion confirm...
>>
>>
>

Re: Need a little help with using SVD

Posted by Timothy Potter <th...@gmail.com>.

Thanks for all the help! Was finally able to go through the process of
running SVD job and then preparing the output for clustering using
matrixmult. Patch has been posted for MAHOUT-639. Also, I formalized this
thread into an example in the wiki, see the "Example: SVD of ASF Mail
Archives on Amazon Elastic MapReduce" at
https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

My next step is to run the k-Means job on the new matrix and then start
working on the task of creating a "clean command line integration from text
=> hashed vector => clusters" ...

Tim

On Wed, Mar 30, 2011 at 3:39 PM, Lance Norskog <go...@gmail.com> wrote:

> Also, many Vector implementations have their own Element class. Will
> each need a custom comparator?
>
> On Tue, Mar 29, 2011 at 7:53 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> > Hmmm... maybe I'm being paranoid, but iirc, a custom Comparator on
> > Vector.Element instances will fail to produce correct results because
> > Vector.iterate() reuses Element object instances.  It'll be fast, yes,
> but
> > does your code pass the current unit tests?
> >
> > I ask because I think I've tried this before... :)
> >
> > On Mar 29, 2011 5:39 PM, "Timothy Potter" <th...@gmail.com> wrote:
> >
> > Hi Jake,
> >
> > Success! I implemented a basic Element Comparator and sorted the random
> > vector data before adding to the SequentialAccessSparseVector as you
> > recommended and the TransposeJob ripped through my data in about 5 mins!
> My
> > implementation is basic at this point relying on a custom
> > Comparator<Element> and Arrays.sort() vs. the optimal way you suggested,
> but
> > gets the job done and is actually pretty fast ... I'll post a patch after
> > I've added some test cases for this.
> >
> > Thanks again for your help.
> >
> > Cheers,
> > Tim
> >
> > On Tue, Mar 29, 2011 at 7:29 AM, Jake Mannix <ja...@gmail.com>
> wrote:
> >> > Riding in the cab ...
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Need a little help with using SVD

Posted by Lance Norskog <go...@gmail.com>.

Also, many Vector implementations have their own Element class. Will
each need a custom comparator?

On Tue, Mar 29, 2011 at 7:53 PM, Jake Mannix <ja...@gmail.com> wrote:
> Hmmm... maybe I'm being paranoid, but iirc, a custom Comparator on
> Vector.Element instances will fail to produce correct results because
> Vector.iterate() reuses Element object instances.  It'll be fast, yes, but
> does your code pass the current unit tests?
>
> I ask because I think I've tried this before... :)
>
> On Mar 29, 2011 5:39 PM, "Timothy Potter" <th...@gmail.com> wrote:
>
> Hi Jake,
>
> Success! I implemented a basic Element Comparator and sorted the random
> vector data before adding to the SequentialAccessSparseVector as you
> recommended and the TransposeJob ripped through my data in about 5 mins! My
> implementation is basic at this point relying on a custom
> Comparator<Element> and Arrays.sort() vs. the optimal way you suggested, but
> gets the job done and is actually pretty fast ... I'll post a patch after
> I've added some test cases for this.
>
> Thanks again for your help.
>
> Cheers,
> Tim
>
> On Tue, Mar 29, 2011 at 7:29 AM, Jake Mannix <ja...@gmail.com> wrote:
>> > Riding in the cab ...
>



-- 
Lance Norskog
goksron@gmail.com

Re: Need a little help with using SVD

Posted by Jake Mannix <ja...@gmail.com>.

Hmmm... maybe I'm being paranoid, but iirc, a custom Comparator on
Vector.Element instances will fail to produce correct results because
Vector.iterate() reuses Element object instances.  It'll be fast, yes, but
does your code pass the current unit tests?

I ask because I think I've tried this before... :)

On Mar 29, 2011 5:39 PM, "Timothy Potter" <th...@gmail.com> wrote:

Hi Jake,

Success! I implemented a basic Element Comparator and sorted the random
vector data before adding to the SequentialAccessSparseVector as you
recommended and the TransposeJob ripped through my data in about 5 mins! My
implementation is basic at this point relying on a custom
Comparator<Element> and Arrays.sort() vs. the optimal way you suggested, but
gets the job done and is actually pretty fast ... I'll post a patch after
I've added some test cases for this.

Thanks again for your help.

Cheers,
Tim

On Tue, Mar 29, 2011 at 7:29 AM, Jake Mannix <ja...@gmail.com> wrote:
> > Riding in the cab ...

Re: Need a little help with using SVD

Posted by Timothy Potter <th...@gmail.com>.

Hi Jake,

Success! I implemented a basic Element Comparator and sorted the random
vector data before adding to the SequentialAccessSparseVector as you
recommended and the TransposeJob ripped through my data in about 5 mins! My
implementation is basic at this point relying on a custom
Comparator<Element> and Arrays.sort() vs. the optimal way you suggested, but
gets the job done and is actually pretty fast ... I'll post a patch after
I've added some test cases for this.

Thanks again for your help.

Cheers,
Tim

On Tue, Mar 29, 2011 at 7:29 AM, Jake Mannix <ja...@gmail.com> wrote:

> Riding in the cab to the airport, I just realized that this makes the
> perfect interview question: the optimal way to do this constructor only
> makes two new primitive arrays, and forces you to essentially reimplement
> the basic in-place sort of the java library - you can't reuse the one in
> Arrays or Collections because you do want it to act on two arrays at once.
>
> I always feel guilty having people reimplement sort(), but here you
> basically have to!  In the real world, you can cut and paste, but it gives a
> totally plausible backstory to the question in an interview.
>
>   -jake
>
> On Mar 29, 2011 6:15 AM, "Timothy Potter" <th...@gmail.com> wrote:
>
> Thanks for the pointer!
>
> I created issue #639 in JIRA to track this -- I'll refactor the code using
> Jake's suggestions and post a patch later today.
>
> https://issues.apache.org/jira/browse/MAHOUT-639
>
> Best regards,
> Tim
>
> On Tue, Mar 29, 2011 at 12:37 AM, Jake Mannix <ja...@gmail.com>
> wrote: > > Suspicion confirm...
>
>

Re: Need a little help with using SVD

Posted by Jake Mannix <ja...@gmail.com>.

Riding in the cab to the airport, I just realized that this makes the
perfect interview question: the optimal way to do this constructor only
makes two new primitive arrays, and forces you to essentially reimplement
the basic in-place sort of the java library - you can't reuse the one in
Arrays or Collections because you do want it to act on two arrays at once.

I always feel guilty having people reimplement sort(), but here you
basically have to!  In the real world, you can cut and paste, but it gives a
totally plausible backstory to the question in an interview.

  -jake

On Mar 29, 2011 6:15 AM, "Timothy Potter" <th...@gmail.com> wrote:

Thanks for the pointer!

I created issue #639 in JIRA to track this -- I'll refactor the code using
Jake's suggestions and post a patch later today.

https://issues.apache.org/jira/browse/MAHOUT-639

Best regards,
Tim

On Tue, Mar 29, 2011 at 12:37 AM, Jake Mannix <ja...@gmail.com> wrote:
> > Suspicion confirm...

Re: Need a little help with using SVD

Posted by Timothy Potter <th...@gmail.com>.

Thanks for the pointer!

I created issue #639 in JIRA to track this -- I'll refactor the code using
Jake's suggestions and post a patch later today.

https://issues.apache.org/jira/browse/MAHOUT-639

Best regards,
Tim

On Tue, Mar 29, 2011 at 12:37 AM, Jake Mannix <ja...@gmail.com> wrote:

> Suspicion confirmed:
>
>   public SequentialAccessSparseVector(Vector other) {
>     this(other.size(), other.getNumNondefaultElements());
>     Iterator<Element> it = other.iterateNonZero();
>     Element e;
>     while (it.hasNext() && (e = it.next()) != null) {
>       set(e.index(), e.get());
>     }
>   }
>
> we iterate over the other vector (which is in random/hashed order), adding
> it to the sequential access vector (which always tries to stay in sequential
> order).  So actually, this may be *worse* than O(n^2), but I'd prefer to
> just not know how much worse, and instead we should fix it.
>
> Should be fairly straightforward: make an array of structs (essentially)
> with the index and the double, of size other.getNumNonDefaultElements()
> (what a horrible method name), fill it up on one iteration over the other
> vector, sort it in place, then make your new OrderedIntDoubleMapping out of
> the indexes and values (unless someone has a cleverer idea to sort a pair of
> two arrays at the same time, shuffling one based on the ordering criterion
> of the other).
>
>   -jake
>
> On Tue, Mar 29, 2011 at 5:38 AM, Ted Dunning <te...@gmail.com>wrote:
>
>> This sort could probably done with a radix sort to make it truly linear as
>> well.
>>
>> I suspect previously unsuspected n^2 behavior.  That behavior now
>> officially loses its unsuspected nature (whether it exists or not).
>>
>>
>> On Mon, Mar 28, 2011 at 10:27 PM, Jake Mannix <ja...@gmail.com>wrote:
>>
>>> Hmm... a sort should be fine (N log N is the new linear!), what I'm
>>> wondering
>>> if it's not something more insidious: like incrementally building up the
>>> SequentialAccessSparseVector one element at a time (or some such N^2
>>> operation - N^2 is not the new anything, it's still N^2) which isn't
>>> noticed
>>> when it's only called rarely (at the end of the reduce, once per vector)
>>> and
>>> N is usually 100's to 1000's, not millions.
>>>
>>> On Tue, Mar 29, 2011 at 5:14 AM, Ted Dunning <te...@gmail.com>wrote:
>>>
>>>> The issue here is almost certainly that the sort implied by this
>>>> conversion
>>>> is only reasonable with a small number of non-zero elements.  Special
>>>> casing
>>>> a relatively dense case is probably worthwhile.
>>>>
>>>>
>>>> On Mon, Mar 28, 2011 at 10:04 PM, Timothy Potter <thelabdude@gmail.com
>>>> >wrote:
>>>>
>>>> > Hi Jake,
>>>> >
>>>> > I've tracked down the bottleneck in the TransposeJob's reducer to the
>>>> line
>>>> > where Mahout creates a new SequentialAccessSparseVector from a
>>>> > RandomAccessSparseVector after the while loop completes:
>>>> >
>>>> >      SequentialAccessSparseVector outVector = new
>>>> > SequentialAccessSparseVector(tmp);
>>>> >
>>>> > For high-frequency terms (some of which occur over ~1M times in my
>>>> data),
>>>> > the code to create a SequentialAccessSparseVector from a
>>>> > RandomAccessSparseVector bogs down completely. Of course, I should
>>>> probably
>>>> > get rid of the terms that occur this frequently and it seems I could
>>>> > continue to increase mapred.task.timeout but as you said, "transpose
>>>> job is
>>>> > totally trivial" so there may be a bigger issue here that is worth
>>>> digging
>>>> > into further. I think it has something to do with
>>>> OrderedIntDoubleMapping,
>>>> > but haven't had enough time to devote to this issue to know for sure
>>>> ...
>>>> > I'll keep at it though.
>>>> >
>>>> > Cheers,
>>>> > Tim
>>>> >
>>>> > On Mon, Mar 21, 2011 at 6:34 PM, Jake Mannix <ja...@gmail.com>
>>>> > wrote:
>>>> >
>>>> > >
>>>> > >
>>>> > > On Sun, Mar 20, 2011 at 5:27 PM, Timothy Potter <
>>>> thelabdude@gmail.com
>>>> > >wrote:
>>>> > >
>>>> > >> Hi Jake,
>>>> > >>
>>>> > >>
>>>> > >> I'm using a cluster with 8 data nodes (EC2 xlarge instances) with 3
>>>> > >> reducers
>>>> > >> per node and mapred.child.java.opts=-Xmx4096m. The map tasks
>>>> completed
>>>> > >> within a few minutes but then all of my 24 reducers failed near the
>>>> 70%
>>>> > >> mark
>>>> > >> with error "Task attempt_201103201840_0004_r_000023_0 failed to
>>>> report
>>>> > >> status for 601 seconds. Killing!" The data node servers stayed at a
>>>> > >> healthy
>>>> > >> load avg below 4 and never paged ...
>>>> > >>
>>>> > >
>>>> > > Weird.   The transpose job is totally trivial, but it may be that
>>>> the
>>>> > > vectors you
>>>> > > are producing are abnormally large.  In fact, if this came from
>>>> text,
>>>> > this
>>>> > > is
>>>> > > probably the case: very common words (which aren't stop words) occur
>>>> > > in most of your documents, and so are very very large.  Might they
>>>> be
>>>> > > larger than your memory? 6-million entries, each one, when
>>>> represented
>>>> > > sparsely (sadly, in this case), have an int and a float, so still
>>>> less
>>>> > than
>>>> > > 100MB,
>>>> > > so that's not it.
>>>> > >
>>>> > > BTW: you need to make sure you use the same number of reducers as
>>>> when
>>>> > > you made your eigenvector matrix, as that's a requirement for a
>>>> map-side
>>>> > > join.   That should probably be documented!
>>>> > >
>>>> > >
>>>> > >> So I increased the "mapred.task.timeout" Hadoop parameter to 20
>>>> minutes
>>>> > >> instead of the default 10 minutes and it failed again. The reduce
>>>> code
>>>> > for
>>>> > >> the TransposeJob looks straight-forward, so I'm going to have to
>>>> dig in
>>>> > >> deeper to figure out what's causing the problem.
>>>> > >>
>>>> > >
>>>> > > Yeah, if you can see where it's timing out, that would be great.
>>>>  Not
>>>> > sure
>>>> > > why that task should ever take very long.
>>>> > >
>>>> > >   -jake
>>>> > >
>>>> > >
>>>> > >
>>>> > >> Cheers,
>>>> > >> Tim
>>>> > >>
>>>> > >> On Sat, Mar 19, 2011 at 3:34 PM, Jake Mannix <
>>>> jake.mannix@gmail.com>
>>>> > >> wrote:
>>>> > >>
>>>> > >> > On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter <
>>>> > thelabdude@gmail.com
>>>> > >> >wrote:
>>>> > >> >
>>>> > >> >> Regarding Jake's comment: " ... you need to run the RowIdJob on
>>>> these
>>>> > >> >> tfidf-vectors first ..."
>>>> > >> >>
>>>> > >> >> I did this and now have an m x n matrix T (m=6076937, n=20444).
>>>> My
>>>> > SVD
>>>> > >> >> eigenvector matrix E is p x q (p=87, q=20444).
>>>> > >> >
>>>> > >> >
>>>> > >> > Ok, so to help you understand what's going on here, I'm going to
>>>> go
>>>> > into
>>>> > >> > a little of the inner details of what's going on here.
>>>> > >> >
>>>> > >> > You are right, you have a matrix T, with 6,076,937 rows, and each
>>>> row
>>>> > >> has
>>>> > >> > 20,444 columns (most of which are zero, and it's represented
>>>> sparsely,
>>>> > >> but
>>>> > >> > still, they live in a vector space of dimension 20,444).
>>>>  Similarly,
>>>> > >> you've
>>>> > >> > made an eigenvector matrix, which has 87 rows (ie 87
>>>> eigenvectors) and
>>>> > >> > each of these rows has exactly 20,444 columns (and most likely,
>>>> > they'll
>>>> > >> > all be nonzero, because eigenvectors have no reason to be
>>>> sparse).
>>>> > >> >
>>>> > >> > In particular, T and E are represented as *lists of rows*, each
>>>> row is
>>>> > a
>>>> > >> > vector of dimension 20,444.  T has six million of these rows, and
>>>> E
>>>> > has
>>>> > >> > only 87.
>>>> > >> >
>>>> > >> >
>>>> > >> >> So to multiply these two
>>>> > >> >> matrices, I need to transpose E so that the number of columns in
>>>> T
>>>> > >> equals
>>>> > >> >> the number of rows in E (i.e. E^T is q x p) the result of the
>>>> > >> matrixmult
>>>> > >> >> would give me an m x p matrix (m=6076937, p=87).
>>>> > >> >>
>>>> > >> >
>>>> > >> > You're exactly right that you want to multiply T by E^T, because
>>>> you
>>>> > >> can't
>>>> > >> > compute T * E.
>>>> > >> >
>>>> > >> > The way it turns out in practice, computing the matrix product of
>>>> two
>>>> > >> > matrices as a map-reduce job is efficiently done as a map-side
>>>> join on
>>>> > >> > two row-based matrices with the same number of rows, and the
>>>> columns
>>>> > >> > are the ones which are different.  In particular, if you take a
>>>> matrix
>>>> > X
>>>> > >> > which
>>>> > >> > is represented as a set of numRowsX rows, each of which has
>>>> numColsX,
>>>> > >> > and another matrix with numRowsY == numRowsX, each of which has
>>>> > >> > numColsY (!= numColsX), then by summing the outer-products of
>>>> each
>>>> > >> > of the numRowsX pairs of vectors, you get a matrix of with
>>>> numRowsZ ==
>>>> > >> > numColsX, and numColsZ == numColsY (if you instead take the
>>>> reverse
>>>> > >> > outer product of the vector pairs, you can end up with the
>>>> transpose
>>>> > of
>>>> > >> > this
>>>> > >> > final result, with numRowsZ == numColsY, and numColsZ ==
>>>> numColsX).
>>>> > >> >
>>>> > >> > Unfortunately, you have a pair of matrices which have different
>>>> > numbers
>>>> > >> > of rows, and the same number of columns, but you want a pair of
>>>> > matrices
>>>> > >> > with the same number of rows and (possibly) different numbers of
>>>> > >> columns.
>>>> > >> >
>>>> > >> >
>>>> > >> >> So I tried to run matrixmult with:
>>>> > >> >
>>>> > >> > matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444
>>>> > >> --numColsB
>>>> > >> >> 87 \
>>>> > >> >> --inputPathA
>>>> > >> >>
>>>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix
>>>> > \
>>>> > >> >> --inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244
>>>> > >> >
>>>> > >> >
>>>> > >> >> (--inputPathA points to the output of the rowid job)
>>>> > >> >>
>>>> > >> >> This results in:
>>>> > >> >> Exception in thread "main"
>>>> > org.apache.mahout.math.CardinalityException:
>>>> > >> >>
>>>> > >> >
>>>> > >> >
>>>> > >> >> In the code, I see the test that row counts must be identical
>>>> for the
>>>> > >> two
>>>> > >> >>
>>>> > >> >> input matrices. Thus, it seems like the job requires me to
>>>> transpose
>>>> > >> this
>>>> > >> >> large matrix, just to re-transpose it back to it's original form
>>>> > during
>>>> > >> >> the
>>>> > >> >> multiplication? Or have I missed something crucial again?
>>>> > >> >>
>>>> > >> >
>>>> > >> > You actually need to transpose the input matrix (T), and then
>>>> re-run
>>>> > >> with
>>>> > >> > T^t
>>>> > >> > and E^t (the latter you apparently already have created).
>>>> > >> >
>>>> > >> > We should really rename the "matrixmultiply" job to be called
>>>> > >> > "transposematrixmultiply", because that's what it really does.
>>>> > >> >
>>>> > >> >   -jake
>>>> > >> >
>>>> > >>
>>>> > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Re: Need a little help with using SVD

Posted by Jake Mannix <ja...@gmail.com>.

Suspicion confirmed:

  public SequentialAccessSparseVector(Vector other) {
    this(other.size(), other.getNumNondefaultElements());
    Iterator<Element> it = other.iterateNonZero();
    Element e;
    while (it.hasNext() && (e = it.next()) != null) {
      set(e.index(), e.get());
    }
  }

we iterate over the other vector (which is in random/hashed order), adding
it to the sequential access vector (which always tries to stay in sequential
order).  So actually, this may be *worse* than O(n^2), but I'd prefer to
just not know how much worse, and instead we should fix it.

Should be fairly straightforward: make an array of structs (essentially)
with the index and the double, of size other.getNumNonDefaultElements()
(what a horrible method name), fill it up on one iteration over the other
vector, sort it in place, then make your new OrderedIntDoubleMapping out of
the indexes and values (unless someone has a cleverer idea to sort a pair of
two arrays at the same time, shuffling one based on the ordering criterion
of the other).

  -jake

On Tue, Mar 29, 2011 at 5:38 AM, Ted Dunning <te...@gmail.com> wrote:

> This sort could probably done with a radix sort to make it truly linear as
> well.
>
> I suspect previously unsuspected n^2 behavior.  That behavior now
> officially loses its unsuspected nature (whether it exists or not).
>
>
> On Mon, Mar 28, 2011 at 10:27 PM, Jake Mannix <ja...@gmail.com>wrote:
>
>> Hmm... a sort should be fine (N log N is the new linear!), what I'm
>> wondering
>> if it's not something more insidious: like incrementally building up the
>> SequentialAccessSparseVector one element at a time (or some such N^2
>> operation - N^2 is not the new anything, it's still N^2) which isn't
>> noticed
>> when it's only called rarely (at the end of the reduce, once per vector)
>> and
>> N is usually 100's to 1000's, not millions.
>>
>> On Tue, Mar 29, 2011 at 5:14 AM, Ted Dunning <te...@gmail.com>wrote:
>>
>>> The issue here is almost certainly that the sort implied by this
>>> conversion
>>> is only reasonable with a small number of non-zero elements.  Special
>>> casing
>>> a relatively dense case is probably worthwhile.
>>>
>>>
>>> On Mon, Mar 28, 2011 at 10:04 PM, Timothy Potter <thelabdude@gmail.com
>>> >wrote:
>>>
>>> > Hi Jake,
>>> >
>>> > I've tracked down the bottleneck in the TransposeJob's reducer to the
>>> line
>>> > where Mahout creates a new SequentialAccessSparseVector from a
>>> > RandomAccessSparseVector after the while loop completes:
>>> >
>>> >      SequentialAccessSparseVector outVector = new
>>> > SequentialAccessSparseVector(tmp);
>>> >
>>> > For high-frequency terms (some of which occur over ~1M times in my
>>> data),
>>> > the code to create a SequentialAccessSparseVector from a
>>> > RandomAccessSparseVector bogs down completely. Of course, I should
>>> probably
>>> > get rid of the terms that occur this frequently and it seems I could
>>> > continue to increase mapred.task.timeout but as you said, "transpose
>>> job is
>>> > totally trivial" so there may be a bigger issue here that is worth
>>> digging
>>> > into further. I think it has something to do with
>>> OrderedIntDoubleMapping,
>>> > but haven't had enough time to devote to this issue to know for sure
>>> ...
>>> > I'll keep at it though.
>>> >
>>> > Cheers,
>>> > Tim
>>> >
>>> > On Mon, Mar 21, 2011 at 6:34 PM, Jake Mannix <ja...@gmail.com>
>>> > wrote:
>>> >
>>> > >
>>> > >
>>> > > On Sun, Mar 20, 2011 at 5:27 PM, Timothy Potter <
>>> thelabdude@gmail.com
>>> > >wrote:
>>> > >
>>> > >> Hi Jake,
>>> > >>
>>> > >>
>>> > >> I'm using a cluster with 8 data nodes (EC2 xlarge instances) with 3
>>> > >> reducers
>>> > >> per node and mapred.child.java.opts=-Xmx4096m. The map tasks
>>> completed
>>> > >> within a few minutes but then all of my 24 reducers failed near the
>>> 70%
>>> > >> mark
>>> > >> with error "Task attempt_201103201840_0004_r_000023_0 failed to
>>> report
>>> > >> status for 601 seconds. Killing!" The data node servers stayed at a
>>> > >> healthy
>>> > >> load avg below 4 and never paged ...
>>> > >>
>>> > >
>>> > > Weird.   The transpose job is totally trivial, but it may be that the
>>> > > vectors you
>>> > > are producing are abnormally large.  In fact, if this came from text,
>>> > this
>>> > > is
>>> > > probably the case: very common words (which aren't stop words) occur
>>> > > in most of your documents, and so are very very large.  Might they be
>>> > > larger than your memory? 6-million entries, each one, when
>>> represented
>>> > > sparsely (sadly, in this case), have an int and a float, so still
>>> less
>>> > than
>>> > > 100MB,
>>> > > so that's not it.
>>> > >
>>> > > BTW: you need to make sure you use the same number of reducers as
>>> when
>>> > > you made your eigenvector matrix, as that's a requirement for a
>>> map-side
>>> > > join.   That should probably be documented!
>>> > >
>>> > >
>>> > >> So I increased the "mapred.task.timeout" Hadoop parameter to 20
>>> minutes
>>> > >> instead of the default 10 minutes and it failed again. The reduce
>>> code
>>> > for
>>> > >> the TransposeJob looks straight-forward, so I'm going to have to dig
>>> in
>>> > >> deeper to figure out what's causing the problem.
>>> > >>
>>> > >
>>> > > Yeah, if you can see where it's timing out, that would be great.  Not
>>> > sure
>>> > > why that task should ever take very long.
>>> > >
>>> > >   -jake
>>> > >
>>> > >
>>> > >
>>> > >> Cheers,
>>> > >> Tim
>>> > >>
>>> > >> On Sat, Mar 19, 2011 at 3:34 PM, Jake Mannix <jake.mannix@gmail.com
>>> >
>>> > >> wrote:
>>> > >>
>>> > >> > On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter <
>>> > thelabdude@gmail.com
>>> > >> >wrote:
>>> > >> >
>>> > >> >> Regarding Jake's comment: " ... you need to run the RowIdJob on
>>> these
>>> > >> >> tfidf-vectors first ..."
>>> > >> >>
>>> > >> >> I did this and now have an m x n matrix T (m=6076937, n=20444).
>>> My
>>> > SVD
>>> > >> >> eigenvector matrix E is p x q (p=87, q=20444).
>>> > >> >
>>> > >> >
>>> > >> > Ok, so to help you understand what's going on here, I'm going to
>>> go
>>> > into
>>> > >> > a little of the inner details of what's going on here.
>>> > >> >
>>> > >> > You are right, you have a matrix T, with 6,076,937 rows, and each
>>> row
>>> > >> has
>>> > >> > 20,444 columns (most of which are zero, and it's represented
>>> sparsely,
>>> > >> but
>>> > >> > still, they live in a vector space of dimension 20,444).
>>>  Similarly,
>>> > >> you've
>>> > >> > made an eigenvector matrix, which has 87 rows (ie 87 eigenvectors)
>>> and
>>> > >> > each of these rows has exactly 20,444 columns (and most likely,
>>> > they'll
>>> > >> > all be nonzero, because eigenvectors have no reason to be sparse).
>>> > >> >
>>> > >> > In particular, T and E are represented as *lists of rows*, each
>>> row is
>>> > a
>>> > >> > vector of dimension 20,444.  T has six million of these rows, and
>>> E
>>> > has
>>> > >> > only 87.
>>> > >> >
>>> > >> >
>>> > >> >> So to multiply these two
>>> > >> >> matrices, I need to transpose E so that the number of columns in
>>> T
>>> > >> equals
>>> > >> >> the number of rows in E (i.e. E^T is q x p) the result of the
>>> > >> matrixmult
>>> > >> >> would give me an m x p matrix (m=6076937, p=87).
>>> > >> >>
>>> > >> >
>>> > >> > You're exactly right that you want to multiply T by E^T, because
>>> you
>>> > >> can't
>>> > >> > compute T * E.
>>> > >> >
>>> > >> > The way it turns out in practice, computing the matrix product of
>>> two
>>> > >> > matrices as a map-reduce job is efficiently done as a map-side
>>> join on
>>> > >> > two row-based matrices with the same number of rows, and the
>>> columns
>>> > >> > are the ones which are different.  In particular, if you take a
>>> matrix
>>> > X
>>> > >> > which
>>> > >> > is represented as a set of numRowsX rows, each of which has
>>> numColsX,
>>> > >> > and another matrix with numRowsY == numRowsX, each of which has
>>> > >> > numColsY (!= numColsX), then by summing the outer-products of each
>>> > >> > of the numRowsX pairs of vectors, you get a matrix of with
>>> numRowsZ ==
>>> > >> > numColsX, and numColsZ == numColsY (if you instead take the
>>> reverse
>>> > >> > outer product of the vector pairs, you can end up with the
>>> transpose
>>> > of
>>> > >> > this
>>> > >> > final result, with numRowsZ == numColsY, and numColsZ ==
>>> numColsX).
>>> > >> >
>>> > >> > Unfortunately, you have a pair of matrices which have different
>>> > numbers
>>> > >> > of rows, and the same number of columns, but you want a pair of
>>> > matrices
>>> > >> > with the same number of rows and (possibly) different numbers of
>>> > >> columns.
>>> > >> >
>>> > >> >
>>> > >> >> So I tried to run matrixmult with:
>>> > >> >
>>> > >> > matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444
>>> > >> --numColsB
>>> > >> >> 87 \
>>> > >> >> --inputPathA
>>> > >> >>
>>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix
>>> > \
>>> > >> >> --inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244
>>> > >> >
>>> > >> >
>>> > >> >> (--inputPathA points to the output of the rowid job)
>>> > >> >>
>>> > >> >> This results in:
>>> > >> >> Exception in thread "main"
>>> > org.apache.mahout.math.CardinalityException:
>>> > >> >>
>>> > >> >
>>> > >> >
>>> > >> >> In the code, I see the test that row counts must be identical for
>>> the
>>> > >> two
>>> > >> >>
>>> > >> >> input matrices. Thus, it seems like the job requires me to
>>> transpose
>>> > >> this
>>> > >> >> large matrix, just to re-transpose it back to it's original form
>>> > during
>>> > >> >> the
>>> > >> >> multiplication? Or have I missed something crucial again?
>>> > >> >>
>>> > >> >
>>> > >> > You actually need to transpose the input matrix (T), and then
>>> re-run
>>> > >> with
>>> > >> > T^t
>>> > >> > and E^t (the latter you apparently already have created).
>>> > >> >
>>> > >> > We should really rename the "matrixmultiply" job to be called
>>> > >> > "transposematrixmultiply", because that's what it really does.
>>> > >> >
>>> > >> >   -jake
>>> > >> >
>>> > >>
>>> > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Need a little help with using SVD

Posted by Ted Dunning <te...@gmail.com>.

This sort could probably done with a radix sort to make it truly linear as
well.

I suspect previously unsuspected n^2 behavior.  That behavior now officially
loses its unsuspected nature (whether it exists or not).

On Mon, Mar 28, 2011 at 10:27 PM, Jake Mannix <ja...@gmail.com> wrote:

> Hmm... a sort should be fine (N log N is the new linear!), what I'm
> wondering
> if it's not something more insidious: like incrementally building up the
> SequentialAccessSparseVector one element at a time (or some such N^2
> operation - N^2 is not the new anything, it's still N^2) which isn't
> noticed
> when it's only called rarely (at the end of the reduce, once per vector)
> and
> N is usually 100's to 1000's, not millions.
>
> On Tue, Mar 29, 2011 at 5:14 AM, Ted Dunning <te...@gmail.com>wrote:
>
>> The issue here is almost certainly that the sort implied by this
>> conversion
>> is only reasonable with a small number of non-zero elements.  Special
>> casing
>> a relatively dense case is probably worthwhile.
>>
>>
>> On Mon, Mar 28, 2011 at 10:04 PM, Timothy Potter <thelabdude@gmail.com
>> >wrote:
>>
>> > Hi Jake,
>> >
>> > I've tracked down the bottleneck in the TransposeJob's reducer to the
>> line
>> > where Mahout creates a new SequentialAccessSparseVector from a
>> > RandomAccessSparseVector after the while loop completes:
>> >
>> >      SequentialAccessSparseVector outVector = new
>> > SequentialAccessSparseVector(tmp);
>> >
>> > For high-frequency terms (some of which occur over ~1M times in my
>> data),
>> > the code to create a SequentialAccessSparseVector from a
>> > RandomAccessSparseVector bogs down completely. Of course, I should
>> probably
>> > get rid of the terms that occur this frequently and it seems I could
>> > continue to increase mapred.task.timeout but as you said, "transpose job
>> is
>> > totally trivial" so there may be a bigger issue here that is worth
>> digging
>> > into further. I think it has something to do with
>> OrderedIntDoubleMapping,
>> > but haven't had enough time to devote to this issue to know for sure ...
>> > I'll keep at it though.
>> >
>> > Cheers,
>> > Tim
>> >
>> > On Mon, Mar 21, 2011 at 6:34 PM, Jake Mannix <ja...@gmail.com>
>> > wrote:
>> >
>> > >
>> > >
>> > > On Sun, Mar 20, 2011 at 5:27 PM, Timothy Potter <thelabdude@gmail.com
>> > >wrote:
>> > >
>> > >> Hi Jake,
>> > >>
>> > >>
>> > >> I'm using a cluster with 8 data nodes (EC2 xlarge instances) with 3
>> > >> reducers
>> > >> per node and mapred.child.java.opts=-Xmx4096m. The map tasks
>> completed
>> > >> within a few minutes but then all of my 24 reducers failed near the
>> 70%
>> > >> mark
>> > >> with error "Task attempt_201103201840_0004_r_000023_0 failed to
>> report
>> > >> status for 601 seconds. Killing!" The data node servers stayed at a
>> > >> healthy
>> > >> load avg below 4 and never paged ...
>> > >>
>> > >
>> > > Weird.   The transpose job is totally trivial, but it may be that the
>> > > vectors you
>> > > are producing are abnormally large.  In fact, if this came from text,
>> > this
>> > > is
>> > > probably the case: very common words (which aren't stop words) occur
>> > > in most of your documents, and so are very very large.  Might they be
>> > > larger than your memory? 6-million entries, each one, when represented
>> > > sparsely (sadly, in this case), have an int and a float, so still less
>> > than
>> > > 100MB,
>> > > so that's not it.
>> > >
>> > > BTW: you need to make sure you use the same number of reducers as when
>> > > you made your eigenvector matrix, as that's a requirement for a
>> map-side
>> > > join.   That should probably be documented!
>> > >
>> > >
>> > >> So I increased the "mapred.task.timeout" Hadoop parameter to 20
>> minutes
>> > >> instead of the default 10 minutes and it failed again. The reduce
>> code
>> > for
>> > >> the TransposeJob looks straight-forward, so I'm going to have to dig
>> in
>> > >> deeper to figure out what's causing the problem.
>> > >>
>> > >
>> > > Yeah, if you can see where it's timing out, that would be great.  Not
>> > sure
>> > > why that task should ever take very long.
>> > >
>> > >   -jake
>> > >
>> > >
>> > >
>> > >> Cheers,
>> > >> Tim
>> > >>
>> > >> On Sat, Mar 19, 2011 at 3:34 PM, Jake Mannix <ja...@gmail.com>
>> > >> wrote:
>> > >>
>> > >> > On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter <
>> > thelabdude@gmail.com
>> > >> >wrote:
>> > >> >
>> > >> >> Regarding Jake's comment: " ... you need to run the RowIdJob on
>> these
>> > >> >> tfidf-vectors first ..."
>> > >> >>
>> > >> >> I did this and now have an m x n matrix T (m=6076937, n=20444). My
>> > SVD
>> > >> >> eigenvector matrix E is p x q (p=87, q=20444).
>> > >> >
>> > >> >
>> > >> > Ok, so to help you understand what's going on here, I'm going to go
>> > into
>> > >> > a little of the inner details of what's going on here.
>> > >> >
>> > >> > You are right, you have a matrix T, with 6,076,937 rows, and each
>> row
>> > >> has
>> > >> > 20,444 columns (most of which are zero, and it's represented
>> sparsely,
>> > >> but
>> > >> > still, they live in a vector space of dimension 20,444).
>>  Similarly,
>> > >> you've
>> > >> > made an eigenvector matrix, which has 87 rows (ie 87 eigenvectors)
>> and
>> > >> > each of these rows has exactly 20,444 columns (and most likely,
>> > they'll
>> > >> > all be nonzero, because eigenvectors have no reason to be sparse).
>> > >> >
>> > >> > In particular, T and E are represented as *lists of rows*, each row
>> is
>> > a
>> > >> > vector of dimension 20,444.  T has six million of these rows, and E
>> > has
>> > >> > only 87.
>> > >> >
>> > >> >
>> > >> >> So to multiply these two
>> > >> >> matrices, I need to transpose E so that the number of columns in T
>> > >> equals
>> > >> >> the number of rows in E (i.e. E^T is q x p) the result of the
>> > >> matrixmult
>> > >> >> would give me an m x p matrix (m=6076937, p=87).
>> > >> >>
>> > >> >
>> > >> > You're exactly right that you want to multiply T by E^T, because
>> you
>> > >> can't
>> > >> > compute T * E.
>> > >> >
>> > >> > The way it turns out in practice, computing the matrix product of
>> two
>> > >> > matrices as a map-reduce job is efficiently done as a map-side join
>> on
>> > >> > two row-based matrices with the same number of rows, and the
>> columns
>> > >> > are the ones which are different.  In particular, if you take a
>> matrix
>> > X
>> > >> > which
>> > >> > is represented as a set of numRowsX rows, each of which has
>> numColsX,
>> > >> > and another matrix with numRowsY == numRowsX, each of which has
>> > >> > numColsY (!= numColsX), then by summing the outer-products of each
>> > >> > of the numRowsX pairs of vectors, you get a matrix of with numRowsZ
>> ==
>> > >> > numColsX, and numColsZ == numColsY (if you instead take the reverse
>> > >> > outer product of the vector pairs, you can end up with the
>> transpose
>> > of
>> > >> > this
>> > >> > final result, with numRowsZ == numColsY, and numColsZ == numColsX).
>> > >> >
>> > >> > Unfortunately, you have a pair of matrices which have different
>> > numbers
>> > >> > of rows, and the same number of columns, but you want a pair of
>> > matrices
>> > >> > with the same number of rows and (possibly) different numbers of
>> > >> columns.
>> > >> >
>> > >> >
>> > >> >> So I tried to run matrixmult with:
>> > >> >
>> > >> > matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444
>> > >> --numColsB
>> > >> >> 87 \
>> > >> >> --inputPathA
>> > >> >>
>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix
>> > \
>> > >> >> --inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244
>> > >> >
>> > >> >
>> > >> >> (--inputPathA points to the output of the rowid job)
>> > >> >>
>> > >> >> This results in:
>> > >> >> Exception in thread "main"
>> > org.apache.mahout.math.CardinalityException:
>> > >> >>
>> > >> >
>> > >> >
>> > >> >> In the code, I see the test that row counts must be identical for
>> the
>> > >> two
>> > >> >>
>> > >> >> input matrices. Thus, it seems like the job requires me to
>> transpose
>> > >> this
>> > >> >> large matrix, just to re-transpose it back to it's original form
>> > during
>> > >> >> the
>> > >> >> multiplication? Or have I missed something crucial again?
>> > >> >>
>> > >> >
>> > >> > You actually need to transpose the input matrix (T), and then
>> re-run
>> > >> with
>> > >> > T^t
>> > >> > and E^t (the latter you apparently already have created).
>> > >> >
>> > >> > We should really rename the "matrixmultiply" job to be called
>> > >> > "transposematrixmultiply", because that's what it really does.
>> > >> >
>> > >> >   -jake
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: Need a little help with using SVD

Posted by Jake Mannix <ja...@gmail.com>.

Hmm... a sort should be fine (N log N is the new linear!), what I'm
wondering
if it's not something more insidious: like incrementally building up the
SequentialAccessSparseVector one element at a time (or some such N^2
operation - N^2 is not the new anything, it's still N^2) which isn't
noticed
when it's only called rarely (at the end of the reduce, once per vector)
and
N is usually 100's to 1000's, not millions.

On Tue, Mar 29, 2011 at 5:14 AM, Ted Dunning <te...@gmail.com> wrote:

> The issue here is almost certainly that the sort implied by this conversion
> is only reasonable with a small number of non-zero elements.  Special
> casing
> a relatively dense case is probably worthwhile.
>
>
> On Mon, Mar 28, 2011 at 10:04 PM, Timothy Potter <thelabdude@gmail.com
> >wrote:
>
> > Hi Jake,
> >
> > I've tracked down the bottleneck in the TransposeJob's reducer to the
> line
> > where Mahout creates a new SequentialAccessSparseVector from a
> > RandomAccessSparseVector after the while loop completes:
> >
> >      SequentialAccessSparseVector outVector = new
> > SequentialAccessSparseVector(tmp);
> >
> > For high-frequency terms (some of which occur over ~1M times in my data),
> > the code to create a SequentialAccessSparseVector from a
> > RandomAccessSparseVector bogs down completely. Of course, I should
> probably
> > get rid of the terms that occur this frequently and it seems I could
> > continue to increase mapred.task.timeout but as you said, "transpose job
> is
> > totally trivial" so there may be a bigger issue here that is worth
> digging
> > into further. I think it has something to do with
> OrderedIntDoubleMapping,
> > but haven't had enough time to devote to this issue to know for sure ...
> > I'll keep at it though.
> >
> > Cheers,
> > Tim
> >
> > On Mon, Mar 21, 2011 at 6:34 PM, Jake Mannix <ja...@gmail.com>
> > wrote:
> >
> > >
> > >
> > > On Sun, Mar 20, 2011 at 5:27 PM, Timothy Potter <thelabdude@gmail.com
> > >wrote:
> > >
> > >> Hi Jake,
> > >>
> > >>
> > >> I'm using a cluster with 8 data nodes (EC2 xlarge instances) with 3
> > >> reducers
> > >> per node and mapred.child.java.opts=-Xmx4096m. The map tasks completed
> > >> within a few minutes but then all of my 24 reducers failed near the
> 70%
> > >> mark
> > >> with error "Task attempt_201103201840_0004_r_000023_0 failed to report
> > >> status for 601 seconds. Killing!" The data node servers stayed at a
> > >> healthy
> > >> load avg below 4 and never paged ...
> > >>
> > >
> > > Weird.   The transpose job is totally trivial, but it may be that the
> > > vectors you
> > > are producing are abnormally large.  In fact, if this came from text,
> > this
> > > is
> > > probably the case: very common words (which aren't stop words) occur
> > > in most of your documents, and so are very very large.  Might they be
> > > larger than your memory? 6-million entries, each one, when represented
> > > sparsely (sadly, in this case), have an int and a float, so still less
> > than
> > > 100MB,
> > > so that's not it.
> > >
> > > BTW: you need to make sure you use the same number of reducers as when
> > > you made your eigenvector matrix, as that's a requirement for a
> map-side
> > > join.   That should probably be documented!
> > >
> > >
> > >> So I increased the "mapred.task.timeout" Hadoop parameter to 20
> minutes
> > >> instead of the default 10 minutes and it failed again. The reduce code
> > for
> > >> the TransposeJob looks straight-forward, so I'm going to have to dig
> in
> > >> deeper to figure out what's causing the problem.
> > >>
> > >
> > > Yeah, if you can see where it's timing out, that would be great.  Not
> > sure
> > > why that task should ever take very long.
> > >
> > >   -jake
> > >
> > >
> > >
> > >> Cheers,
> > >> Tim
> > >>
> > >> On Sat, Mar 19, 2011 at 3:34 PM, Jake Mannix <ja...@gmail.com>
> > >> wrote:
> > >>
> > >> > On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter <
> > thelabdude@gmail.com
> > >> >wrote:
> > >> >
> > >> >> Regarding Jake's comment: " ... you need to run the RowIdJob on
> these
> > >> >> tfidf-vectors first ..."
> > >> >>
> > >> >> I did this and now have an m x n matrix T (m=6076937, n=20444). My
> > SVD
> > >> >> eigenvector matrix E is p x q (p=87, q=20444).
> > >> >
> > >> >
> > >> > Ok, so to help you understand what's going on here, I'm going to go
> > into
> > >> > a little of the inner details of what's going on here.
> > >> >
> > >> > You are right, you have a matrix T, with 6,076,937 rows, and each
> row
> > >> has
> > >> > 20,444 columns (most of which are zero, and it's represented
> sparsely,
> > >> but
> > >> > still, they live in a vector space of dimension 20,444).  Similarly,
> > >> you've
> > >> > made an eigenvector matrix, which has 87 rows (ie 87 eigenvectors)
> and
> > >> > each of these rows has exactly 20,444 columns (and most likely,
> > they'll
> > >> > all be nonzero, because eigenvectors have no reason to be sparse).
> > >> >
> > >> > In particular, T and E are represented as *lists of rows*, each row
> is
> > a
> > >> > vector of dimension 20,444.  T has six million of these rows, and E
> > has
> > >> > only 87.
> > >> >
> > >> >
> > >> >> So to multiply these two
> > >> >> matrices, I need to transpose E so that the number of columns in T
> > >> equals
> > >> >> the number of rows in E (i.e. E^T is q x p) the result of the
> > >> matrixmult
> > >> >> would give me an m x p matrix (m=6076937, p=87).
> > >> >>
> > >> >
> > >> > You're exactly right that you want to multiply T by E^T, because you
> > >> can't
> > >> > compute T * E.
> > >> >
> > >> > The way it turns out in practice, computing the matrix product of
> two
> > >> > matrices as a map-reduce job is efficiently done as a map-side join
> on
> > >> > two row-based matrices with the same number of rows, and the columns
> > >> > are the ones which are different.  In particular, if you take a
> matrix
> > X
> > >> > which
> > >> > is represented as a set of numRowsX rows, each of which has
> numColsX,
> > >> > and another matrix with numRowsY == numRowsX, each of which has
> > >> > numColsY (!= numColsX), then by summing the outer-products of each
> > >> > of the numRowsX pairs of vectors, you get a matrix of with numRowsZ
> ==
> > >> > numColsX, and numColsZ == numColsY (if you instead take the reverse
> > >> > outer product of the vector pairs, you can end up with the transpose
> > of
> > >> > this
> > >> > final result, with numRowsZ == numColsY, and numColsZ == numColsX).
> > >> >
> > >> > Unfortunately, you have a pair of matrices which have different
> > numbers
> > >> > of rows, and the same number of columns, but you want a pair of
> > matrices
> > >> > with the same number of rows and (possibly) different numbers of
> > >> columns.
> > >> >
> > >> >
> > >> >> So I tried to run matrixmult with:
> > >> >
> > >> > matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444
> > >> --numColsB
> > >> >> 87 \
> > >> >> --inputPathA
> > >> >>
> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix
> > \
> > >> >> --inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244
> > >> >
> > >> >
> > >> >> (--inputPathA points to the output of the rowid job)
> > >> >>
> > >> >> This results in:
> > >> >> Exception in thread "main"
> > org.apache.mahout.math.CardinalityException:
> > >> >>
> > >> >
> > >> >
> > >> >> In the code, I see the test that row counts must be identical for
> the
> > >> two
> > >> >>
> > >> >> input matrices. Thus, it seems like the job requires me to
> transpose
> > >> this
> > >> >> large matrix, just to re-transpose it back to it's original form
> > during
> > >> >> the
> > >> >> multiplication? Or have I missed something crucial again?
> > >> >>
> > >> >
> > >> > You actually need to transpose the input matrix (T), and then re-run
> > >> with
> > >> > T^t
> > >> > and E^t (the latter you apparently already have created).
> > >> >
> > >> > We should really rename the "matrixmultiply" job to be called
> > >> > "transposematrixmultiply", because that's what it really does.
> > >> >
> > >> >   -jake
> > >> >
> > >>
> > >
> > >
> >
>

Re: Need a little help with using SVD

Posted by Ted Dunning <te...@gmail.com>.

The issue here is almost certainly that the sort implied by this conversion
is only reasonable with a small number of non-zero elements.  Special casing
a relatively dense case is probably worthwhile.


On Mon, Mar 28, 2011 at 10:04 PM, Timothy Potter <th...@gmail.com>wrote:

> Hi Jake,
>
> I've tracked down the bottleneck in the TransposeJob's reducer to the line
> where Mahout creates a new SequentialAccessSparseVector from a
> RandomAccessSparseVector after the while loop completes:
>
>      SequentialAccessSparseVector outVector = new
> SequentialAccessSparseVector(tmp);
>
> For high-frequency terms (some of which occur over ~1M times in my data),
> the code to create a SequentialAccessSparseVector from a
> RandomAccessSparseVector bogs down completely. Of course, I should probably
> get rid of the terms that occur this frequently and it seems I could
> continue to increase mapred.task.timeout but as you said, "transpose job is
> totally trivial" so there may be a bigger issue here that is worth digging
> into further. I think it has something to do with OrderedIntDoubleMapping,
> but haven't had enough time to devote to this issue to know for sure ...
> I'll keep at it though.
>
> Cheers,
> Tim
>
> On Mon, Mar 21, 2011 at 6:34 PM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> >
> >
> > On Sun, Mar 20, 2011 at 5:27 PM, Timothy Potter <thelabdude@gmail.com
> >wrote:
> >
> >> Hi Jake,
> >>
> >>
> >> I'm using a cluster with 8 data nodes (EC2 xlarge instances) with 3
> >> reducers
> >> per node and mapred.child.java.opts=-Xmx4096m. The map tasks completed
> >> within a few minutes but then all of my 24 reducers failed near the 70%
> >> mark
> >> with error "Task attempt_201103201840_0004_r_000023_0 failed to report
> >> status for 601 seconds. Killing!" The data node servers stayed at a
> >> healthy
> >> load avg below 4 and never paged ...
> >>
> >
> > Weird.   The transpose job is totally trivial, but it may be that the
> > vectors you
> > are producing are abnormally large.  In fact, if this came from text,
> this
> > is
> > probably the case: very common words (which aren't stop words) occur
> > in most of your documents, and so are very very large.  Might they be
> > larger than your memory? 6-million entries, each one, when represented
> > sparsely (sadly, in this case), have an int and a float, so still less
> than
> > 100MB,
> > so that's not it.
> >
> > BTW: you need to make sure you use the same number of reducers as when
> > you made your eigenvector matrix, as that's a requirement for a map-side
> > join.   That should probably be documented!
> >
> >
> >> So I increased the "mapred.task.timeout" Hadoop parameter to 20 minutes
> >> instead of the default 10 minutes and it failed again. The reduce code
> for
> >> the TransposeJob looks straight-forward, so I'm going to have to dig in
> >> deeper to figure out what's causing the problem.
> >>
> >
> > Yeah, if you can see where it's timing out, that would be great.  Not
> sure
> > why that task should ever take very long.
> >
> >   -jake
> >
> >
> >
> >> Cheers,
> >> Tim
> >>
> >> On Sat, Mar 19, 2011 at 3:34 PM, Jake Mannix <ja...@gmail.com>
> >> wrote:
> >>
> >> > On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter <
> thelabdude@gmail.com
> >> >wrote:
> >> >
> >> >> Regarding Jake's comment: " ... you need to run the RowIdJob on these
> >> >> tfidf-vectors first ..."
> >> >>
> >> >> I did this and now have an m x n matrix T (m=6076937, n=20444). My
> SVD
> >> >> eigenvector matrix E is p x q (p=87, q=20444).
> >> >
> >> >
> >> > Ok, so to help you understand what's going on here, I'm going to go
> into
> >> > a little of the inner details of what's going on here.
> >> >
> >> > You are right, you have a matrix T, with 6,076,937 rows, and each row
> >> has
> >> > 20,444 columns (most of which are zero, and it's represented sparsely,
> >> but
> >> > still, they live in a vector space of dimension 20,444).  Similarly,
> >> you've
> >> > made an eigenvector matrix, which has 87 rows (ie 87 eigenvectors) and
> >> > each of these rows has exactly 20,444 columns (and most likely,
> they'll
> >> > all be nonzero, because eigenvectors have no reason to be sparse).
> >> >
> >> > In particular, T and E are represented as *lists of rows*, each row is
> a
> >> > vector of dimension 20,444.  T has six million of these rows, and E
> has
> >> > only 87.
> >> >
> >> >
> >> >> So to multiply these two
> >> >> matrices, I need to transpose E so that the number of columns in T
> >> equals
> >> >> the number of rows in E (i.e. E^T is q x p) the result of the
> >> matrixmult
> >> >> would give me an m x p matrix (m=6076937, p=87).
> >> >>
> >> >
> >> > You're exactly right that you want to multiply T by E^T, because you
> >> can't
> >> > compute T * E.
> >> >
> >> > The way it turns out in practice, computing the matrix product of two
> >> > matrices as a map-reduce job is efficiently done as a map-side join on
> >> > two row-based matrices with the same number of rows, and the columns
> >> > are the ones which are different.  In particular, if you take a matrix
> X
> >> > which
> >> > is represented as a set of numRowsX rows, each of which has numColsX,
> >> > and another matrix with numRowsY == numRowsX, each of which has
> >> > numColsY (!= numColsX), then by summing the outer-products of each
> >> > of the numRowsX pairs of vectors, you get a matrix of with numRowsZ ==
> >> > numColsX, and numColsZ == numColsY (if you instead take the reverse
> >> > outer product of the vector pairs, you can end up with the transpose
> of
> >> > this
> >> > final result, with numRowsZ == numColsY, and numColsZ == numColsX).
> >> >
> >> > Unfortunately, you have a pair of matrices which have different
> numbers
> >> > of rows, and the same number of columns, but you want a pair of
> matrices
> >> > with the same number of rows and (possibly) different numbers of
> >> columns.
> >> >
> >> >
> >> >> So I tried to run matrixmult with:
> >> >
> >> > matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444
> >> --numColsB
> >> >> 87 \
> >> >> --inputPathA
> >> >> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix
> \
> >> >> --inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244
> >> >
> >> >
> >> >> (--inputPathA points to the output of the rowid job)
> >> >>
> >> >> This results in:
> >> >> Exception in thread "main"
> org.apache.mahout.math.CardinalityException:
> >> >>
> >> >
> >> >
> >> >> In the code, I see the test that row counts must be identical for the
> >> two
> >> >>
> >> >> input matrices. Thus, it seems like the job requires me to transpose
> >> this
> >> >> large matrix, just to re-transpose it back to it's original form
> during
> >> >> the
> >> >> multiplication? Or have I missed something crucial again?
> >> >>
> >> >
> >> > You actually need to transpose the input matrix (T), and then re-run
> >> with
> >> > T^t
> >> > and E^t (the latter you apparently already have created).
> >> >
> >> > We should really rename the "matrixmultiply" job to be called
> >> > "transposematrixmultiply", because that's what it really does.
> >> >
> >> >   -jake
> >> >
> >>
> >
> >
>

Re: Need a little help with using SVD

Posted by Timothy Potter <th...@gmail.com>.

Hi Jake,

I've tracked down the bottleneck in the TransposeJob's reducer to the line
where Mahout creates a new SequentialAccessSparseVector from a
RandomAccessSparseVector after the while loop completes:

      SequentialAccessSparseVector outVector = new
SequentialAccessSparseVector(tmp);

For high-frequency terms (some of which occur over ~1M times in my data),
the code to create a SequentialAccessSparseVector from a
RandomAccessSparseVector bogs down completely. Of course, I should probably
get rid of the terms that occur this frequently and it seems I could
continue to increase mapred.task.timeout but as you said, "transpose job is
totally trivial" so there may be a bigger issue here that is worth digging
into further. I think it has something to do with OrderedIntDoubleMapping,
but haven't had enough time to devote to this issue to know for sure ...
I'll keep at it though.

Cheers,
Tim

On Mon, Mar 21, 2011 at 6:34 PM, Jake Mannix <ja...@gmail.com> wrote:

>
>
> On Sun, Mar 20, 2011 at 5:27 PM, Timothy Potter <th...@gmail.com>wrote:
>
>> Hi Jake,
>>
>>
>> I'm using a cluster with 8 data nodes (EC2 xlarge instances) with 3
>> reducers
>> per node and mapred.child.java.opts=-Xmx4096m. The map tasks completed
>> within a few minutes but then all of my 24 reducers failed near the 70%
>> mark
>> with error "Task attempt_201103201840_0004_r_000023_0 failed to report
>> status for 601 seconds. Killing!" The data node servers stayed at a
>> healthy
>> load avg below 4 and never paged ...
>>
>
> Weird.   The transpose job is totally trivial, but it may be that the
> vectors you
> are producing are abnormally large.  In fact, if this came from text, this
> is
> probably the case: very common words (which aren't stop words) occur
> in most of your documents, and so are very very large.  Might they be
> larger than your memory? 6-million entries, each one, when represented
> sparsely (sadly, in this case), have an int and a float, so still less than
> 100MB,
> so that's not it.
>
> BTW: you need to make sure you use the same number of reducers as when
> you made your eigenvector matrix, as that's a requirement for a map-side
> join.   That should probably be documented!
>
>
>> So I increased the "mapred.task.timeout" Hadoop parameter to 20 minutes
>> instead of the default 10 minutes and it failed again. The reduce code for
>> the TransposeJob looks straight-forward, so I'm going to have to dig in
>> deeper to figure out what's causing the problem.
>>
>
> Yeah, if you can see where it's timing out, that would be great.  Not sure
> why that task should ever take very long.
>
>   -jake
>
>
>
>> Cheers,
>> Tim
>>
>> On Sat, Mar 19, 2011 at 3:34 PM, Jake Mannix <ja...@gmail.com>
>> wrote:
>>
>> > On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter <thelabdude@gmail.com
>> >wrote:
>> >
>> >> Regarding Jake's comment: " ... you need to run the RowIdJob on these
>> >> tfidf-vectors first ..."
>> >>
>> >> I did this and now have an m x n matrix T (m=6076937, n=20444). My SVD
>> >> eigenvector matrix E is p x q (p=87, q=20444).
>> >
>> >
>> > Ok, so to help you understand what's going on here, I'm going to go into
>> > a little of the inner details of what's going on here.
>> >
>> > You are right, you have a matrix T, with 6,076,937 rows, and each row
>> has
>> > 20,444 columns (most of which are zero, and it's represented sparsely,
>> but
>> > still, they live in a vector space of dimension 20,444).  Similarly,
>> you've
>> > made an eigenvector matrix, which has 87 rows (ie 87 eigenvectors) and
>> > each of these rows has exactly 20,444 columns (and most likely, they'll
>> > all be nonzero, because eigenvectors have no reason to be sparse).
>> >
>> > In particular, T and E are represented as *lists of rows*, each row is a
>> > vector of dimension 20,444.  T has six million of these rows, and E has
>> > only 87.
>> >
>> >
>> >> So to multiply these two
>> >> matrices, I need to transpose E so that the number of columns in T
>> equals
>> >> the number of rows in E (i.e. E^T is q x p) the result of the
>> matrixmult
>> >> would give me an m x p matrix (m=6076937, p=87).
>> >>
>> >
>> > You're exactly right that you want to multiply T by E^T, because you
>> can't
>> > compute T * E.
>> >
>> > The way it turns out in practice, computing the matrix product of two
>> > matrices as a map-reduce job is efficiently done as a map-side join on
>> > two row-based matrices with the same number of rows, and the columns
>> > are the ones which are different.  In particular, if you take a matrix X
>> > which
>> > is represented as a set of numRowsX rows, each of which has numColsX,
>> > and another matrix with numRowsY == numRowsX, each of which has
>> > numColsY (!= numColsX), then by summing the outer-products of each
>> > of the numRowsX pairs of vectors, you get a matrix of with numRowsZ ==
>> > numColsX, and numColsZ == numColsY (if you instead take the reverse
>> > outer product of the vector pairs, you can end up with the transpose of
>> > this
>> > final result, with numRowsZ == numColsY, and numColsZ == numColsX).
>> >
>> > Unfortunately, you have a pair of matrices which have different numbers
>> > of rows, and the same number of columns, but you want a pair of matrices
>> > with the same number of rows and (possibly) different numbers of
>> columns.
>> >
>> >
>> >> So I tried to run matrixmult with:
>> >
>> > matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444
>> --numColsB
>> >> 87 \
>> >> --inputPathA
>> >> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix \
>> >> --inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244
>> >
>> >
>> >> (--inputPathA points to the output of the rowid job)
>> >>
>> >> This results in:
>> >> Exception in thread "main" org.apache.mahout.math.CardinalityException:
>> >>
>> >
>> >
>> >> In the code, I see the test that row counts must be identical for the
>> two
>> >>
>> >> input matrices. Thus, it seems like the job requires me to transpose
>> this
>> >> large matrix, just to re-transpose it back to it's original form during
>> >> the
>> >> multiplication? Or have I missed something crucial again?
>> >>
>> >
>> > You actually need to transpose the input matrix (T), and then re-run
>> with
>> > T^t
>> > and E^t (the latter you apparently already have created).
>> >
>> > We should really rename the "matrixmultiply" job to be called
>> > "transposematrixmultiply", because that's what it really does.
>> >
>> >   -jake
>> >
>>
>
>

Re: Need a little help with using SVD

Posted by Jake Mannix <ja...@gmail.com>.

On Sun, Mar 20, 2011 at 5:27 PM, Timothy Potter <th...@gmail.com>wrote:

> Hi Jake,
>
> I'm using a cluster with 8 data nodes (EC2 xlarge instances) with 3
> reducers
> per node and mapred.child.java.opts=-Xmx4096m. The map tasks completed
> within a few minutes but then all of my 24 reducers failed near the 70%
> mark
> with error "Task attempt_201103201840_0004_r_000023_0 failed to report
> status for 601 seconds. Killing!" The data node servers stayed at a healthy
> load avg below 4 and never paged ...
>

Weird.   The transpose job is totally trivial, but it may be that the
vectors you
are producing are abnormally large.  In fact, if this came from text, this
is
probably the case: very common words (which aren't stop words) occur
in most of your documents, and so are very very large.  Might they be
larger than your memory? 6-million entries, each one, when represented
sparsely (sadly, in this case), have an int and a float, so still less than
100MB,
so that's not it.

BTW: you need to make sure you use the same number of reducers as when
you made your eigenvector matrix, as that's a requirement for a map-side
join.   That should probably be documented!


> So I increased the "mapred.task.timeout" Hadoop parameter to 20 minutes
> instead of the default 10 minutes and it failed again. The reduce code for
> the TransposeJob looks straight-forward, so I'm going to have to dig in
> deeper to figure out what's causing the problem.
>

Yeah, if you can see where it's timing out, that would be great.  Not sure
why that task should ever take very long.

  -jake



> Cheers,
> Tim
>
> On Sat, Mar 19, 2011 at 3:34 PM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter <thelabdude@gmail.com
> >wrote:
> >
> >> Regarding Jake's comment: " ... you need to run the RowIdJob on these
> >> tfidf-vectors first ..."
> >>
> >> I did this and now have an m x n matrix T (m=6076937, n=20444). My SVD
> >> eigenvector matrix E is p x q (p=87, q=20444).
> >
> >
> > Ok, so to help you understand what's going on here, I'm going to go into
> > a little of the inner details of what's going on here.
> >
> > You are right, you have a matrix T, with 6,076,937 rows, and each row has
> > 20,444 columns (most of which are zero, and it's represented sparsely,
> but
> > still, they live in a vector space of dimension 20,444).  Similarly,
> you've
> > made an eigenvector matrix, which has 87 rows (ie 87 eigenvectors) and
> > each of these rows has exactly 20,444 columns (and most likely, they'll
> > all be nonzero, because eigenvectors have no reason to be sparse).
> >
> > In particular, T and E are represented as *lists of rows*, each row is a
> > vector of dimension 20,444.  T has six million of these rows, and E has
> > only 87.
> >
> >
> >> So to multiply these two
> >> matrices, I need to transpose E so that the number of columns in T
> equals
> >> the number of rows in E (i.e. E^T is q x p) the result of the matrixmult
> >> would give me an m x p matrix (m=6076937, p=87).
> >>
> >
> > You're exactly right that you want to multiply T by E^T, because you
> can't
> > compute T * E.
> >
> > The way it turns out in practice, computing the matrix product of two
> > matrices as a map-reduce job is efficiently done as a map-side join on
> > two row-based matrices with the same number of rows, and the columns
> > are the ones which are different.  In particular, if you take a matrix X
> > which
> > is represented as a set of numRowsX rows, each of which has numColsX,
> > and another matrix with numRowsY == numRowsX, each of which has
> > numColsY (!= numColsX), then by summing the outer-products of each
> > of the numRowsX pairs of vectors, you get a matrix of with numRowsZ ==
> > numColsX, and numColsZ == numColsY (if you instead take the reverse
> > outer product of the vector pairs, you can end up with the transpose of
> > this
> > final result, with numRowsZ == numColsY, and numColsZ == numColsX).
> >
> > Unfortunately, you have a pair of matrices which have different numbers
> > of rows, and the same number of columns, but you want a pair of matrices
> > with the same number of rows and (possibly) different numbers of columns.
> >
> >
> >> So I tried to run matrixmult with:
> >
> > matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444
> --numColsB
> >> 87 \
> >> --inputPathA
> >> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix \
> >> --inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244
> >
> >
> >> (--inputPathA points to the output of the rowid job)
> >>
> >> This results in:
> >> Exception in thread "main" org.apache.mahout.math.CardinalityException:
> >>
> >
> >
> >> In the code, I see the test that row counts must be identical for the
> two
> >>
> >> input matrices. Thus, it seems like the job requires me to transpose
> this
> >> large matrix, just to re-transpose it back to it's original form during
> >> the
> >> multiplication? Or have I missed something crucial again?
> >>
> >
> > You actually need to transpose the input matrix (T), and then re-run with
> > T^t
> > and E^t (the latter you apparently already have created).
> >
> > We should really rename the "matrixmultiply" job to be called
> > "transposematrixmultiply", because that's what it really does.
> >
> >   -jake
> >
>

Re: Need a little help with using SVD

Posted by Danny Bickson <da...@gmail.com>.

Hi!
I got exactly the same error when running alternating least squares.
You need to setup the task timeout and expiry interval timeouts.
Exact details are found in my blog:
http://bickson.blogspot.com/2011/03/tunning-hadoop-configuration-for-high.html

Best,

- Danny Bickson

On Sun, Mar 20, 2011 at 8:27 PM, Timothy Potter <th...@gmail.com>wrote:

> Hi Jake,
>
> Thank you for the detailed explanation; seems like a very clever way to
> distribute the matrix multiplication process.
>
> I tried running the transpose job on my T matrix using the following
> options:
>
> bin/mahout transpose -i
> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix
> --numRows 6076937 --numCols 20444
>
> I'm using a cluster with 8 data nodes (EC2 xlarge instances) with 3
> reducers
> per node and mapred.child.java.opts=-Xmx4096m. The map tasks completed
> within a few minutes but then all of my 24 reducers failed near the 70%
> mark
> with error "Task attempt_201103201840_0004_r_000023_0 failed to report
> status for 601 seconds. Killing!" The data node servers stayed at a healthy
> load avg below 4 and never paged ...
>
> So I increased the "mapred.task.timeout" Hadoop parameter to 20 minutes
> instead of the default 10 minutes and it failed again. The reduce code for
> the TransposeJob looks straight-forward, so I'm going to have to dig in
> deeper to figure out what's causing the problem.
>
> Cheers,
> Tim
>
> On Sat, Mar 19, 2011 at 3:34 PM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter <thelabdude@gmail.com
> >wrote:
> >
> >> Regarding Jake's comment: " ... you need to run the RowIdJob on these
> >> tfidf-vectors first ..."
> >>
> >> I did this and now have an m x n matrix T (m=6076937, n=20444). My SVD
> >> eigenvector matrix E is p x q (p=87, q=20444).
> >
> >
> > Ok, so to help you understand what's going on here, I'm going to go into
> > a little of the inner details of what's going on here.
> >
> > You are right, you have a matrix T, with 6,076,937 rows, and each row has
> > 20,444 columns (most of which are zero, and it's represented sparsely,
> but
> > still, they live in a vector space of dimension 20,444).  Similarly,
> you've
> > made an eigenvector matrix, which has 87 rows (ie 87 eigenvectors) and
> > each of these rows has exactly 20,444 columns (and most likely, they'll
> > all be nonzero, because eigenvectors have no reason to be sparse).
> >
> > In particular, T and E are represented as *lists of rows*, each row is a
> > vector of dimension 20,444.  T has six million of these rows, and E has
> > only 87.
> >
> >
> >> So to multiply these two
> >> matrices, I need to transpose E so that the number of columns in T
> equals
> >> the number of rows in E (i.e. E^T is q x p) the result of the matrixmult
> >> would give me an m x p matrix (m=6076937, p=87).
> >>
> >
> > You're exactly right that you want to multiply T by E^T, because you
> can't
> > compute T * E.
> >
> > The way it turns out in practice, computing the matrix product of two
> > matrices as a map-reduce job is efficiently done as a map-side join on
> > two row-based matrices with the same number of rows, and the columns
> > are the ones which are different.  In particular, if you take a matrix X
> > which
> > is represented as a set of numRowsX rows, each of which has numColsX,
> > and another matrix with numRowsY == numRowsX, each of which has
> > numColsY (!= numColsX), then by summing the outer-products of each
> > of the numRowsX pairs of vectors, you get a matrix of with numRowsZ ==
> > numColsX, and numColsZ == numColsY (if you instead take the reverse
> > outer product of the vector pairs, you can end up with the transpose of
> > this
> > final result, with numRowsZ == numColsY, and numColsZ == numColsX).
> >
> > Unfortunately, you have a pair of matrices which have different numbers
> > of rows, and the same number of columns, but you want a pair of matrices
> > with the same number of rows and (possibly) different numbers of columns.
> >
> >
> >> So I tried to run matrixmult with:
> >
> > matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444
> --numColsB
> >> 87 \
> >> --inputPathA
> >> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix \
> >> --inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244
> >
> >
> >> (--inputPathA points to the output of the rowid job)
> >>
> >> This results in:
> >> Exception in thread "main" org.apache.mahout.math.CardinalityException:
> >>
> >
> >
> >> In the code, I see the test that row counts must be identical for the
> two
> >>
> >> input matrices. Thus, it seems like the job requires me to transpose
> this
> >> large matrix, just to re-transpose it back to it's original form during
> >> the
> >> multiplication? Or have I missed something crucial again?
> >>
> >
> > You actually need to transpose the input matrix (T), and then re-run with
> > T^t
> > and E^t (the latter you apparently already have created).
> >
> > We should really rename the "matrixmultiply" job to be called
> > "transposematrixmultiply", because that's what it really does.
> >
> >   -jake
> >
>

Re: Need a little help with using SVD

Posted by Timothy Potter <th...@gmail.com>.

Hi Jake,

Thank you for the detailed explanation; seems like a very clever way to
distribute the matrix multiplication process.

I tried running the transpose job on my T matrix using the following
options:

bin/mahout transpose -i
/asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix
--numRows 6076937 --numCols 20444

I'm using a cluster with 8 data nodes (EC2 xlarge instances) with 3 reducers
per node and mapred.child.java.opts=-Xmx4096m. The map tasks completed
within a few minutes but then all of my 24 reducers failed near the 70% mark
with error "Task attempt_201103201840_0004_r_000023_0 failed to report
status for 601 seconds. Killing!" The data node servers stayed at a healthy
load avg below 4 and never paged ...

So I increased the "mapred.task.timeout" Hadoop parameter to 20 minutes
instead of the default 10 minutes and it failed again. The reduce code for
the TransposeJob looks straight-forward, so I'm going to have to dig in
deeper to figure out what's causing the problem.

Cheers,
Tim

On Sat, Mar 19, 2011 at 3:34 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter <th...@gmail.com>wrote:
>
>> Regarding Jake's comment: " ... you need to run the RowIdJob on these
>> tfidf-vectors first ..."
>>
>> I did this and now have an m x n matrix T (m=6076937, n=20444). My SVD
>> eigenvector matrix E is p x q (p=87, q=20444).
>
>
> Ok, so to help you understand what's going on here, I'm going to go into
> a little of the inner details of what's going on here.
>
> You are right, you have a matrix T, with 6,076,937 rows, and each row has
> 20,444 columns (most of which are zero, and it's represented sparsely, but
> still, they live in a vector space of dimension 20,444).  Similarly, you've
> made an eigenvector matrix, which has 87 rows (ie 87 eigenvectors) and
> each of these rows has exactly 20,444 columns (and most likely, they'll
> all be nonzero, because eigenvectors have no reason to be sparse).
>
> In particular, T and E are represented as *lists of rows*, each row is a
> vector of dimension 20,444.  T has six million of these rows, and E has
> only 87.
>
>
>> So to multiply these two
>> matrices, I need to transpose E so that the number of columns in T equals
>> the number of rows in E (i.e. E^T is q x p) the result of the matrixmult
>> would give me an m x p matrix (m=6076937, p=87).
>>
>
> You're exactly right that you want to multiply T by E^T, because you can't
> compute T * E.
>
> The way it turns out in practice, computing the matrix product of two
> matrices as a map-reduce job is efficiently done as a map-side join on
> two row-based matrices with the same number of rows, and the columns
> are the ones which are different.  In particular, if you take a matrix X
> which
> is represented as a set of numRowsX rows, each of which has numColsX,
> and another matrix with numRowsY == numRowsX, each of which has
> numColsY (!= numColsX), then by summing the outer-products of each
> of the numRowsX pairs of vectors, you get a matrix of with numRowsZ ==
> numColsX, and numColsZ == numColsY (if you instead take the reverse
> outer product of the vector pairs, you can end up with the transpose of
> this
> final result, with numRowsZ == numColsY, and numColsZ == numColsX).
>
> Unfortunately, you have a pair of matrices which have different numbers
> of rows, and the same number of columns, but you want a pair of matrices
> with the same number of rows and (possibly) different numbers of columns.
>
>
>> So I tried to run matrixmult with:
>
> matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444 --numColsB
>> 87 \
>> --inputPathA
>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix \
>> --inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244
>
>
>> (--inputPathA points to the output of the rowid job)
>>
>> This results in:
>> Exception in thread "main" org.apache.mahout.math.CardinalityException:
>>
>
>
>> In the code, I see the test that row counts must be identical for the two
>>
>> input matrices. Thus, it seems like the job requires me to transpose this
>> large matrix, just to re-transpose it back to it's original form during
>> the
>> multiplication? Or have I missed something crucial again?
>>
>
> You actually need to transpose the input matrix (T), and then re-run with
> T^t
> and E^t (the latter you apparently already have created).
>
> We should really rename the "matrixmultiply" job to be called
> "transposematrixmultiply", because that's what it really does.
>
>   -jake
>

Re: Need a little help with using SVD

Posted by Jake Mannix <ja...@gmail.com>.

On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter <th...@gmail.com>wrote:

> Regarding Jake's comment: " ... you need to run the RowIdJob on these
> tfidf-vectors first ..."
>
> I did this and now have an m x n matrix T (m=6076937, n=20444). My SVD
> eigenvector matrix E is p x q (p=87, q=20444).

Ok, so to help you understand what's going on here, I'm going to go into
a little of the inner details of what's going on here.

You are right, you have a matrix T, with 6,076,937 rows, and each row has
20,444 columns (most of which are zero, and it's represented sparsely, but
still, they live in a vector space of dimension 20,444).  Similarly, you've
made an eigenvector matrix, which has 87 rows (ie 87 eigenvectors) and
each of these rows has exactly 20,444 columns (and most likely, they'll
all be nonzero, because eigenvectors have no reason to be sparse).

In particular, T and E are represented as *lists of rows*, each row is a
vector of dimension 20,444.  T has six million of these rows, and E has
only 87.

> So to multiply these two
> matrices, I need to transpose E so that the number of columns in T equals
> the number of rows in E (i.e. E^T is q x p) the result of the matrixmult
> would give me an m x p matrix (m=6076937, p=87).
>

You're exactly right that you want to multiply T by E^T, because you can't
compute T * E.

The way it turns out in practice, computing the matrix product of two
matrices as a map-reduce job is efficiently done as a map-side join on
two row-based matrices with the same number of rows, and the columns
are the ones which are different.  In particular, if you take a matrix X
which
is represented as a set of numRowsX rows, each of which has numColsX,
and another matrix with numRowsY == numRowsX, each of which has
numColsY (!= numColsX), then by summing the outer-products of each
of the numRowsX pairs of vectors, you get a matrix of with numRowsZ ==
numColsX, and numColsZ == numColsY (if you instead take the reverse
outer product of the vector pairs, you can end up with the transpose of this
final result, with numRowsZ == numColsY, and numColsZ == numColsX).

Unfortunately, you have a pair of matrices which have different numbers
of rows, and the same number of columns, but you want a pair of matrices
with the same number of rows and (possibly) different numbers of columns.

> So I tried to run matrixmult with:

matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444 --numColsB
> 87 \
> --inputPathA
> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix \
> --inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244

> (--inputPathA points to the output of the rowid job)
>
> This results in:
> Exception in thread "main" org.apache.mahout.math.CardinalityException:
>

> In the code, I see the test that row counts must be identical for the two
> input matrices. Thus, it seems like the job requires me to transpose this
> large matrix, just to re-transpose it back to it's original form during the
> multiplication? Or have I missed something crucial again?
>

You actually need to transpose the input matrix (T), and then re-run with
T^t
and E^t (the latter you apparently already have created).

We should really rename the "matrixmultiply" job to be called
"transposematrixmultiply", because that's what it really does.

  -jake

Re: Need a little help with using SVD

Posted by Timothy Potter <th...@gmail.com>.

Hi,

Just a quick status check on my progress to see if I'm on the right track. I
understand from Ted's comments that SVD may not be needed / useful in this
scenario, but I'd still like to finish going through the motions, if only as
a test of the SVD support in Mahout 0.4 on EC2.

Regarding Jake's comment: " ... you need to run the RowIdJob on these
tfidf-vectors first ..."

I did this and now have an m x n matrix T (m=6076937, n=20444). My SVD
eigenvector matrix E is p x q (p=87, q=20444). So to multiply these two
matrices, I need to transpose E so that the number of columns in T equals
the number of rows in E (i.e. E^T is q x p) the result of the matrixmult
would give me an m x p matrix (m=6076937, p=87).

So I tried to run matrixmult with:

matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444 --numColsB
87 \
--inputPathA
/asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix \
--inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244

(--inputPathA points to the output of the rowid job)

This results in:
Exception in thread "main" org.apache.mahout.math.CardinalityException:
Required cardinality 6076937 but got 20444
    at
org.apache.mahout.math.hadoop.DistributedRowMatrix.times(DistributedRowMatrix.java:146)
    at
org.apache.mahout.math.hadoop.MatrixMultiplicationJob.run(MatrixMultiplicationJob.java:98)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at
org.apache.mahout.math.hadoop.MatrixMultiplicationJob.main(MatrixMultiplicationJob.java:67)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

In the code, I see the test that row counts must be identical for the two
input matrices. Thus, it seems like the job requires me to transpose this
large matrix, just to re-transpose it back to it's original form during the
multiplication? Or have I missed something crucial again?

Cheers,
Tim

On Mon, Mar 14, 2011 at 1:09 PM, Jake Mannix <ja...@gmail.com> wrote:

>
> On Sun, Mar 13, 2011 at 6:47 PM, Timothy Potter <th...@gmail.com>wrote:
>
>> Looking for a little clarification with using SVD to reduce dimensions of
>> my
>> vectors for clustering ...
>>
>> Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf vectors
>> with 20,444 dimensions. I successfully run Mahout SVD on the vectors
>> using:
>>
>> bin/mahout svd -i
>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \
>>    -o /asf-mail-archives/mahout-0.4/svd \
>>    --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true
>>
>> This produced 87 eigenvectors of size 20,444. I'm not clear as to why only
>> 87, but I'm assuming that has something to do with Lanczos???
>>
>
> Hi Timothy,
>
>   The LanczosSolver looks for 100 eigenvectors, but then does some cleanup
> after
> the fact: convergence issues and numeric overflow can cause some
> eigenvectors
> to show up twice - the last step in Mahout SVD is to remove these spurious
> eigenvectors (and also any which just don't appear to be "eigen" enough
> (ie,
> they don't satisfy the eigenvector criterion with high enough fidelity).
>
>   If you really need more eigenvectors, you can try re-running with
> rank=150,
> and then take the top 100 out of however many you get out.
>
> So then I proceeded to transpose the SVD output using:
>>
>> bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444
>> --numRows 87
>>
>> Next, I tried to run transpose on my original vectors using:
>>
>> transpose -i
>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
>> --numCols 20444 --numRows 6076937
>>
>>
> So the problems with this is that the tfidf-vectors is a
> SequenceFile<Text,VectorWritable> - which is fine for input into
> DistributedLanczosSolver (which just needs <Writable,VectorWritable>
> pairs),
> but not so fine for being really considered a "matrix" - you need to run
> the
> RowIdJob on these tfidf-vectors first.  This will normalize your
> SequenceFIle<Text,VectorWritable> into a
> SequenceFile<IntWritable,VectorWritable>
> and a SequenceFIle<IntWritable,Text> (where original one is the join of
> these new ones, on the new int key).
>
> Hope that helps.
>
>   -jake
>

Re: Need a little help with using SVD

Posted by Jake Mannix <ja...@gmail.com>.

On Sun, Mar 13, 2011 at 6:47 PM, Timothy Potter <th...@gmail.com>wrote:

> Looking for a little clarification with using SVD to reduce dimensions of
> my
> vectors for clustering ...
>
> Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf vectors
> with 20,444 dimensions. I successfully run Mahout SVD on the vectors using:
>
> bin/mahout svd -i
> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \
>    -o /asf-mail-archives/mahout-0.4/svd \
>    --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true
>
> This produced 87 eigenvectors of size 20,444. I'm not clear as to why only
> 87, but I'm assuming that has something to do with Lanczos???
>

Hi Timothy,

  The LanczosSolver looks for 100 eigenvectors, but then does some cleanup
after
the fact: convergence issues and numeric overflow can cause some
eigenvectors
to show up twice - the last step in Mahout SVD is to remove these spurious
eigenvectors (and also any which just don't appear to be "eigen" enough (ie,
they don't satisfy the eigenvector criterion with high enough fidelity).

  If you really need more eigenvectors, you can try re-running with
rank=150,
and then take the top 100 out of however many you get out.

So then I proceeded to transpose the SVD output using:
>
> bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444
> --numRows 87
>
> Next, I tried to run transpose on my original vectors using:
>
> transpose -i /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
> --numCols 20444 --numRows 6076937
>
>
So the problems with this is that the tfidf-vectors is a
SequenceFile<Text,VectorWritable> - which is fine for input into
DistributedLanczosSolver (which just needs <Writable,VectorWritable> pairs),
but not so fine for being really considered a "matrix" - you need to run
the
RowIdJob on these tfidf-vectors first.  This will normalize your
SequenceFIle<Text,VectorWritable> into a
SequenceFile<IntWritable,VectorWritable>
and a SequenceFIle<IntWritable,Text> (where original one is the join of
these new ones, on the new int key).

Hope that helps.

  -jake