You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Mat Kelcey <ma...@gmail.com> on 2011/04/17 08:06:34 UTC

strange results running lda against westbury corpus

hi all,

i'm kicking the tyres of lda by running it against the 2009 portion of
the westbury usenet corpus http://bit.ly/eUejPa

here's what i'm doing, based heavily on the build-reuters example

1) download the 2009 section of the corpus to hdfs 'corpus.raw'
it's about 4.5e6 posts over 880e6 lines in 50 bzipped files

2) pack into sequence files where each key is 0 and each value is a
single usenet post
hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar \
 -D mapred.reduce.tasks=0 \
 -input corpus.raw \
 -output corpus.seq \
 -outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat \
 -mapper 'ruby one_article_per_line.rb' \
 -file one_article_per_line.rb

( the one_article_per_line.rb script can be seen at
https://gist.github.com/923435 )

3) convert to sparse sequence format
./bin/mahout seq2sparse -i corpus.seq -o corpus.seq-sparse -wt tf -seq -nr 45

4) check number of tokens in dictionary
hadoop fs -text corpus.seq-sparse/dictionary.file-0 | wc -l
1654229

5) run lda using number terms in dictionary (plus a bit) as number of terms
./bin/mahout lda -i corpus.seq-sparse/tf-vectors -o corpus-lda -k 10
-v 1700000 -ow -x 100
(converges after only 4 iterations)

6) dump the topics
./bin/mahout ldatopics -i corpus-lda/state-4 -d
corpus.seq-sparse/dictionary.file-0 -dt sequencefile

the topics i end up with are pretty much all the same (some crazy rants)
topic0: do our from have you like who i murder he would war zionist alex nazi
topic1: god jews death what all war murder you know can america zionist our
topic2: american alex murder he all like have i our us against justice
america death
topic3: alex your our i all murder 911 against who innocent can
humanity have what

if i run again to convergence from 3 using a subtly different number of reducers

./bin/mahout seq2sparse -i corpus.seq -o corpus.seq-sparse -wt tf -seq
-nr 40 # will this give different results
./bin/mahout lda -i corpus.seq-sparse/tf-vectors -o corpus-lda -k 10
-v 1700000 -ow -x 100
./bin/mahout ldatopics -i corpus-lda/state-5 -d
corpus.seq-sparse/dictionary.file-0 -dt sequencefile

i get the topics
topic0: us what cost budgetary costs war total iraq so we do budget comes would
topic1: account loss had were cost iraq trillion have should costs
than execution do
topic2: effort have item what cost war execution total us too also
iraq billion difficult
topic3: income what have trillion iraq execution costs we were victory
up budgetary

and if i try another different number of reducers i get another result
(though it takes a lot longer to converge)

./bin/mahout seq2sparse -i corpus.seq -o corpus.seq-sparse -wt tf -seq -nr 41
./bin/mahout lda -i corpus.seq-sparse/tf-vectors -o corpus-lda -k 10
-v 1700000 -ow -x 100
./bin/mahout ldatopics -i corpus-lda/state-13 -d
corpus.seq-sparse/dictionary.file-0 -dt sequencefile

topic0: sex nude mature free men women sexy beautiful
topic1: nude sexy free men pics videos photos naked hot
topic2: videos naked asian pictures women free beautiful nude
topic3: sexy older hot naked mature pictures women photos

i expected each topic to be different & also expected that modifying
the number of reducers should have had no impact in what topics were
found (?)

is my packing of the sequence file wrong for lda? i was following the
reuters example of an entire email as a single value in the sequence
file.

is my number of topics, in this case 10, reasonable?

is my approach of using the number of number of terms in the
dictionary as the -v param to lda correct? (there is only one
dictionary.file)

finally here's the contents of the seq-sparse directory; not sure if
the file sizes suggest anything. the contents of the files looks sane
https://gist.github.com/4eb5d5a3a90a064dd612

any thoughts most welcome, i'm happy to rerun using whatever
suggestions people might have

cheers!
mat

Re: strange results running lda against westbury corpus

Posted by Mat Kelcey <ma...@gmail.com>.
> I don't see anything wrong offhand.  You might look at MAHOUT-399.  I think we are trying to review how LDA performs at the moment.  From what I understand, you aren't guaranteed the same results each time (I wonder if there is a way to at least provide some sort of seed value so that one can reproduce a set of results)
>
> At any rate, it's good that you put up detailed instructions of what you did, so that we can compare them.

thanks. i might start on a smaller set with clear distinct topics to
make sure my steps are sane and then build up. i'll let you know how i
go.

Re: strange results running lda against westbury corpus

Posted by Grant Ingersoll <gs...@apache.org>.
On Apr 17, 2011, at 8:06 AM, Mat Kelcey wrote:

> hi all,
> 
> i'm kicking the tyres of lda by running it against the 2009 portion of
> the westbury usenet corpus http://bit.ly/eUejPa
> 
> here's what i'm doing, based heavily on the build-reuters example
> 
> 1) download the 2009 section of the corpus to hdfs 'corpus.raw'
> it's about 4.5e6 posts over 880e6 lines in 50 bzipped files
> 
> 2) pack into sequence files where each key is 0 and each value is a
> single usenet post
> hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar \
> -D mapred.reduce.tasks=0 \
> -input corpus.raw \
> -output corpus.seq \
> -outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat \
> -mapper 'ruby one_article_per_line.rb' \
> -file one_article_per_line.rb
> 
> ( the one_article_per_line.rb script can be seen at
> https://gist.github.com/923435 )
> 
> 3) convert to sparse sequence format
> ./bin/mahout seq2sparse -i corpus.seq -o corpus.seq-sparse -wt tf -seq -nr 45
> 
> 4) check number of tokens in dictionary
> hadoop fs -text corpus.seq-sparse/dictionary.file-0 | wc -l
> 1654229
> 
> 5) run lda using number terms in dictionary (plus a bit) as number of terms
> ./bin/mahout lda -i corpus.seq-sparse/tf-vectors -o corpus-lda -k 10
> -v 1700000 -ow -x 100
> (converges after only 4 iterations)
> 
> 6) dump the topics
> ./bin/mahout ldatopics -i corpus-lda/state-4 -d
> corpus.seq-sparse/dictionary.file-0 -dt sequencefile
> 
> the topics i end up with are pretty much all the same (some crazy rants)
> topic0: do our from have you like who i murder he would war zionist alex nazi
> topic1: god jews death what all war murder you know can america zionist our
> topic2: american alex murder he all like have i our us against justice
> america death
> topic3: alex your our i all murder 911 against who innocent can
> humanity have what
> 
> if i run again to convergence from 3 using a subtly different number of reducers
> 
> ./bin/mahout seq2sparse -i corpus.seq -o corpus.seq-sparse -wt tf -seq
> -nr 40 # will this give different results
> ./bin/mahout lda -i corpus.seq-sparse/tf-vectors -o corpus-lda -k 10
> -v 1700000 -ow -x 100
> ./bin/mahout ldatopics -i corpus-lda/state-5 -d
> corpus.seq-sparse/dictionary.file-0 -dt sequencefile
> 
> i get the topics
> topic0: us what cost budgetary costs war total iraq so we do budget comes would
> topic1: account loss had were cost iraq trillion have should costs
> than execution do
> topic2: effort have item what cost war execution total us too also
> iraq billion difficult
> topic3: income what have trillion iraq execution costs we were victory
> up budgetary
> 
> and if i try another different number of reducers i get another result
> (though it takes a lot longer to converge)
> 
> ./bin/mahout seq2sparse -i corpus.seq -o corpus.seq-sparse -wt tf -seq -nr 41
> ./bin/mahout lda -i corpus.seq-sparse/tf-vectors -o corpus-lda -k 10
> -v 1700000 -ow -x 100
> ./bin/mahout ldatopics -i corpus-lda/state-13 -d
> corpus.seq-sparse/dictionary.file-0 -dt sequencefile
> 
> topic0: sex nude mature free men women sexy beautiful
> topic1: nude sexy free men pics videos photos naked hot
> topic2: videos naked asian pictures women free beautiful nude
> topic3: sexy older hot naked mature pictures women photos
> 
> i expected each topic to be different & also expected that modifying
> the number of reducers should have had no impact in what topics were
> found (?)
> 
> is my packing of the sequence file wrong for lda? i was following the
> reuters example of an entire email as a single value in the sequence
> file.

I don't see anything wrong offhand.  You might look at MAHOUT-399.  I think we are trying to review how LDA performs at the moment.  From what I understand, you aren't guaranteed the same results each time (I wonder if there is a way to at least provide some sort of seed value so that one can reproduce a set of results)

At any rate, it's good that you put up detailed instructions of what you did, so that we can compare them.

> 
> is my number of topics, in this case 10, reasonable?
> 
> is my approach of using the number of number of terms in the
> dictionary as the -v param to lda correct? (there is only one
> dictionary.file)
> 
> finally here's the contents of the seq-sparse directory; not sure if
> the file sizes suggest anything. the contents of the files looks sane
> https://gist.github.com/4eb5d5a3a90a064dd612
> 
> any thoughts most welcome, i'm happy to rerun using whatever
> suggestions people might have
> 
> cheers!
> mat

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search