You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Dan Brickley <da...@danbri.org> on 2011/10/07 17:26:28 UTC

'bin/mahout rowid' failing

Running from recent trunk, I can't get any sign of life from the
'rowid' command / job. Null Pointer Exception, even when just asking
for --help. Details below.

Also I tried running this more directly as 'mahout
org.apache.mahout.utils.vectors.RowIdJob  --input
sparse3/tfidf-vectors/part-r-00000  --output matrixified/' but no
success there either.

I'm trying to get my book code sparse vectors into a form that can be
usefully SVD'd, now that I have made some successful / plausible
clusters using those vectors. I think I need first to transpose them
so my columns correspond to records/books not their subject codes, but
the transpose job complained with type errors, and searching on those
led me to discover the 'rowid' task, which I believe I need to use
before I can transpose my matrix. So I seem to be stuck. Is rowid the
thing to be using here?

Dan

TellyClub:bin danbri$ ./mahout rowid --help
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/Users/danbri/working/hadoop/hadoop-0.20.2
HADOOP_CONF_DIR=/Users/danbri/working/hadoop/hadoop-0.20.2/conf
MAHOUT-JOB: /Users/danbri/working/mahout/trunk/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
^CTellyClub:bin danbri$ MAHOUT_LOCAL=true ./mahout rowid --help
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
MAHOUT_LOCAL is set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/Users/bandri/working/mahout/trunk/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/Users/bandri/working/mahout/trunk/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/Users/bandri/working/mahout/trunk/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Exception in thread "main" java.lang.NullPointerException
	at org.apache.hadoop.fs.Path.<init>(Path.java:61)
	at org.apache.hadoop.fs.Path.<init>(Path.java:50)
	at org.apache.mahout.utils.vectors.RowIdJob.run(RowIdJob.java:49)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
	at org.apache.mahout.utils.vectors.RowIdJob.main(RowIdJob.java:89)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)

Re: 'bin/mahout rowid' failing

Posted by Ted Dunning <te...@gmail.com>.
On Sat, Oct 8, 2011 at 4:11 PM, Dan Brickley <da...@danbri.org> wrote:

> > Also, while you are at it, I think that the code in MAHOUT-792 might be
> able
> > to do these decompositions at your scale much, much faster since they use
> an
> > in-memory algorithm on a single machine to avoid all the kerfuffle of
> > running a map-reduce.
>
> These tests are with just 100k entries. The full collection is
> somewhat over 12 or so million book records, which I'd assumed to be
> Hadoop territory.


It is hadoop territory, but it may also be the territory of less methods as
well.  There is a large region of overlap where both methods work reasonably
well.


> I tried loading a MySQL dump of that on my laptop
> but gave up after a couple weeks :) So thinking is to get a feel for
> things with 100k then have a go with the whole lot, at which point
> map-reduce should earn its keep. FWIW the running time for Lanczos on
> MAHOUT_LOCAL mode MacBook Pro was reasonably painless. But I'll take a
> look at MAHOUT-792 for sure.
>

Cool.  Please give me feedback.  A task I have on the list for after
committing 792 is to build a threaded version of 792 that uses out-of-core
methods.  It should be competitive or even much better than Hadoop into the
range you are work.  But 792 comes first.

Re: 'bin/mahout rowid' failing

Posted by Dan Brickley <da...@danbri.org>.
On 8 October 2011 23:58, Ted Dunning <te...@gmail.com> wrote:
> On Sat, Oct 8, 2011 at 12:43 PM, Dan Brickley <da...@danbri.org> wrote:
>
>> ...
>> ...and I get as expected, a few less than 100 due to the cleaning (88).
>> Each
>> of these has 27683 values, which is the number of topic codes in my data.
>>
>> I'm reading this (correct me if I have this backwards) as if my topics are
>> now data points positioned in a new compressed version of a 'book space'.
>> What I was after was instead 100000 books in a new lower-dimensioned 'topic
>> space' (can I say this as: I want left singular vectors but I'm getting
>> right singular vectors?). Hence the attempt to transpose and rerun Lanczos;
>> I thought this the conceptually simplest if not most efficient way to get
>> there. I understand there are other routes but expected this one to work.
>
> I don't know the options to Lanczos as well as I should.  Your idea that you
> are getting right vectors only is correct and the idea to transpose in order
> to get the left vectors is sound.

Thanks, Ted.

> BUT
>
> I think that there is an option to get the left vectors as well as the right ones.

That would be handy.

bin/mahout svd --help gives no clue of such an option.

When I asked about this a while back (having naively expected 3
matrices back per textbook SVD) your response was

http://www.mail-archive.com/user@mahout.apache.org/msg03144.html
http://www.mail-archive.com/user@mahout.apache.org/msg03149.html
"Generally the SVD in these sorts of situations does not return the
entire set of three matrices.  Instead either the left or right (but
usually the right) eigenvectors premultiplied by the diagonal or the
square root of the diagonal element." and "You can multiply the
original matrix by the transpose of the available eigenvectors and the
inverse of the eigenvalues to get the missing eigenvectors."

(so I figured transposing the input and re-running much safer=simpler
than trying to figure out how to take inverse of the eigenvalues, at
least until I find my feet. But if there's a simple set of
'bin/mahout' calls for this, it would be great to document them in the
Wiki...)

> Also, while you are at it, I think that the code in MAHOUT-792 might be able
> to do these decompositions at your scale much, much faster since they use an
> in-memory algorithm on a single machine to avoid all the kerfuffle of
> running a map-reduce.

These tests are with just 100k entries. The full collection is
somewhat over 12 or so million book records, which I'd assumed to be
Hadoop territory. I tried loading a MySQL dump of that on my laptop
but gave up after a couple weeks :) So thinking is to get a feel for
things with 100k then have a go with the whole lot, at which point
map-reduce should earn its keep. FWIW the running time for Lanczos on
MAHOUT_LOCAL mode MacBook Pro was reasonably painless. But I'll take a
look at MAHOUT-792 for sure.


Back on the original theme: could someone sanity check for me, and try
running 'MAHOUT_LOCAL=true mahout rowid --help' on a recent trunk
built. Since I don't believe my installation should be weird, I'd love
some confirmation that this is OK for others.

cheers,

Dan

Re: 'bin/mahout rowid' failing

Posted by Ted Dunning <te...@gmail.com>.
On Sat, Oct 8, 2011 at 12:43 PM, Dan Brickley <da...@danbri.org> wrote:

> ...
> ...and I get as expected, a few less than 100 due to the cleaning (88).
> Each
> of these has 27683 values, which is the number of topic codes in my data.
>
> I'm reading this (correct me if I have this backwards) as if my topics are
> now data points positioned in a new compressed version of a 'book space'.
> What I was after was instead 100000 books in a new lower-dimensioned 'topic
> space' (can I say this as: I want left singular vectors but I'm getting
> right singular vectors?). Hence the attempt to transpose and rerun Lanczos;
> I thought this the conceptually simplest if not most efficient way to get
> there. I understand there are other routes but expected this one to work.
>

I don't know the options to Lanczos as well as I should.  Your idea that you
are getting right vectors only is correct and the idea to transpose in order
to get the left vectors is sound.

BUT

I think that there is an option to get the left vectors as well as the right
ones.

Also, while you are at it, I think that the code in MAHOUT-792 might be able
to do these decompositions at your scale much, much faster since they use an
in-memory algorithm on a single machine to avoid all the kerfuffle of
running a map-reduce.

'bin/mahout rowid' failing

Posted by Dan Brickley <da...@danbri.org>.
>> I'm trying to get my book code sparse vectors into a form that can be
>> usefully SVD'd, now that I have made some successful / plausible
>> clusters using those vectors. I think I need first to transpose them
>> so my columns correspond to records/books not their subject codes, but
>> the transpose job complained with type errors, and searching on those
>> led me to discover the 'rowid' task, which I believe I need to use
>> before I can transpose my matrix. So I seem to be stuck. Is rowid the
>> thing to be using here?

On 7 October 2011 17:36, Ted Dunning <te...@gmail.com> wrote:
> Actually, if you are clustering books, the books should be rows.

So for the clustering, they were. And simple kmeans was quite rewarding; I
asked for 100000 books to be put into 1000 clusters, and then the
clusterdump summaries show very plausible packages of similar terms. Per
previous thread, I chose to (ab)use the seq2sparse utility and treat my
phrase-based concept codes as words, by regex'ing them into
underscore_based_atoms:

Examples -

               qing_dynasties_1368                     =>
9.295866330464682
               1912                                    =>
5.948517057630751
               painting_chinese_ming                   =>
 3.3054930369059243
               1912_exhibitions                        =>
 2.6730045389246055
               art_chinese_ming                        =>
 2.2829231686062283
               wood                                    =>
 1.8342399243955259
               engraving_chinese_ming                  =>
1.571796558521412
               yuan_dynasties_960                      =>
1.366419615568938
               1368                                    =>
 1.0365213818020291
               porcelain_chinese_ming                  =>
 0.9943387420089157

       Top Terms:
               consciousness_physiology                =>
 11.120451927185059
               buddhism_psychology                     =>
9.916479110717773
               religion_and_medicine                   =>
 9.08357048034668
               neurosciences                           =>
8.960968017578125
               buddhism                                =>
 8.34786319732666

       Top Terms:
               human_evolution                         =>
7.849616527557373
               social_evolution                        =>
 1.6036341407082297
               sociobiology                            =>
1.037030653520064
               biological_evolution                    =>
1.006914080995502
               fossil_hominids                         =>
 0.8849234147505327
               language_and_languages_origin           =>
 0.7156421198989406
               primates_evolution                      =>
 0.6145225871693004
               human_behavior                          =>
 0.5364708177971117
               intellect                               =>
 0.5072070613051906
               human_biology                           =>
 0.4739684191617099


(this is getting off the topic of the original Subject: here, but bear with
me, maybe it's useful Mahout semi-newbie usability feedback?)

...so, flushed with pseudo-success, I thought it was time to have another
look at SVD. By this time I'd got in the habit of poking at Mahout's
previously mysterious binary files using the dump utilities, which helped
make things a bit less confusing.

So, I do svd against the same representation that was kmeans'd above:

mahout svd --cleansvd 1 --rank 100 --input
sparse3/tfidf-vectors/part-r-00000 --output svdout/  --numRows 100000
--numCols 27684

...then cleaned then (I guess '1' doesn't work for --cleaned; the help text
doesn't say), then 'mahout seqdumper --seqFile cleanEigenvectors'

...and I get as expected, a few less than 100 due to the cleaning (88). Each
of these has 27683 values, which is the number of topic codes in my data.

I'm reading this (correct me if I have this backwards) as if my topics are
now data points positioned in a new compressed version of a 'book space'.
What I was after was instead 100000 books in a new lower-dimensioned 'topic
space' (can I say this as: I want left singular vectors but I'm getting
right singular vectors?). Hence the attempt to transpose and rerun Lanczos;
I thought this the conceptually simplest if not most efficient way to get
there. I understand there are other routes but expected this one to work.

Is bin/mahout rowid only failing for me?

Dan

Re: 'bin/mahout rowid' failing

Posted by Ted Dunning <te...@gmail.com>.
Actually, if you are clustering books, the books should be rows.

On Fri, Oct 7, 2011 at 8:26 AM, Dan Brickley <da...@danbri.org> wrote:

> Running from recent trunk, I can't get any sign of life from the
> 'rowid' command / job. Null Pointer Exception, even when just asking
> for --help. Details below.
>
> Also I tried running this more directly as 'mahout
> org.apache.mahout.utils.vectors.RowIdJob  --input
> sparse3/tfidf-vectors/part-r-00000  --output matrixified/' but no
> success there either.
>
> I'm trying to get my book code sparse vectors into a form that can be
> usefully SVD'd, now that I have made some successful / plausible
> clusters using those vectors. I think I need first to transpose them
> so my columns correspond to records/books not their subject codes, but
> the transpose job complained with type errors, and searching on those
> led me to discover the 'rowid' task, which I believe I need to use
> before I can transpose my matrix. So I seem to be stuck. Is rowid the
> thing to be using here?
>
> Dan
>
> TellyClub:bin danbri$ ./mahout rowid --help
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using
> HADOOP_HOME=/Users/danbri/working/hadoop/hadoop-0.20.2
> HADOOP_CONF_DIR=/Users/danbri/working/hadoop/hadoop-0.20.2/conf
> MAHOUT-JOB:
> /Users/danbri/working/mahout/trunk/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
> ^CTellyClub:bin danbri$ MAHOUT_LOCAL=true ./mahout rowid --help
> MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
> MAHOUT_LOCAL is set, running locally
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
>
> [jar:file:/Users/bandri/working/mahout/trunk/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
>
> [jar:file:/Users/bandri/working/mahout/trunk/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
>
> [jar:file:/Users/bandri/working/mahout/trunk/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> Exception in thread "main" java.lang.NullPointerException
>        at org.apache.hadoop.fs.Path.<init>(Path.java:61)
>        at org.apache.hadoop.fs.Path.<init>(Path.java:50)
>        at org.apache.mahout.utils.vectors.RowIdJob.run(RowIdJob.java:49)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>        at org.apache.mahout.utils.vectors.RowIdJob.main(RowIdJob.java:89)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>        at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>