You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Diego Ceccarelli <di...@gmail.com> on 2012/10/28 22:21:17 UTC

Using LDA in Mahout 0.0.7

Dear all,

I'm trying to use the LDA framework in Mahout and I'm experiencing
some troubles. 
I saw these tutorials [1,2], and I decided to apply lda to a collection with
1M of tweets to see how it works. I indexed them with lucene as suggested
in [2]. Then I discovered that in the last version this is not supported
and I had to to use a sequence file. 
I saw the util 'seqdirectory' in [2] but it's a bit impractical to create one million documents,
each one with a tweet. So I wrote a small java app that takes a file where each line 
is a document and creates a sequence file  <Text,Text>  containing the id (line number) 
and the tweet. 
Then  I used seq2sparse util:

./bin/mahout seq2sparse -i ../lda-hello-world/tweet-sequence-file -o /tmp/vector -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow

and I created the vectors. (it succeeded without problems)

Now, I discovered that lda now it's called cvb (why did you change the name? is 
a bit confusing.. ) so I tried to run the command, but I got this error
 
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
(full stack trace here [3])

I also tried the local version:

./bin/mahout cvb0_local -i /tmp/vector/tf-vectors   -d /tmp/vector/dictionary.file-0 --numTopics 100 --docOutputFile /tmp/out --topicOutputFile /tmp/topic

(why the parameters' names are different???) 
But i got a similar error:
Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String
(full stack trace here [4])

Where i'm wrong?? could please help me? 
Thanks 
Diego

[1] https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
[2] https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
[3] http://pastebin.com/nV3T74fe
[4] http://pastebin.com/JH1xQHuC

Re: Using LDA in Mahout 0.0.7

Posted by vineeth <vi...@gmail.com>.

Hello Dan,

Thank you for giving this reference. I was unable to succeed Mahout 
0.0.7 to run LDA so I downgraded to 0.5 to run the LDA and it worked. 
May be  I should try this.

Vineeth
On 12-10-29 02:02 PM, Diego Ceccarelli wrote:
> Thanks Dan, it solved.
>
> On Sun, Oct 28, 2012 at 10:40 PM, DAN HELM <da...@verizon.net> wrote:
>> Hi Diego,
>> A number of us had the same issue when first working with the new CVB
>> algorithm. The vector keys for CVB need to be Integers. You can use the
>> rowid utility to convert the output from seq2sparse to the form needed by
>> CVB, e.g.,
>> http://comments.gmane.org/gmane.comp.apache.mahout.user/13112
>> Dan
>>
>> From: Diego Ceccarelli <di...@gmail.com>
>> To: user@mahout.apache.org
>> Sent: Sunday, October 28, 2012 5:21 PM
>> Subject: Using LDA in Mahout 0.0.7
>>
>> Dear all,
>>
>> I'm trying to use the LDA framework in Mahout and I'm experiencing
>> some troubles.
>> I saw these tutorials [1,2], and I decided to apply lda to a collection with
>> 1M of tweets to see how it works. I indexed them with lucene as suggested
>> in [2]. Then I discovered that in the last version this is not supported
>> and I had to to use a sequence file.
>> I saw the util 'seqdirectory' in [2] but it's a bit impractical to create
>> one million documents,
>> each one with a tweet. So I wrote a small java app that takes a file where
>> each line
>> is a document and creates a sequence file  <Text,Text>  containing the id
>> (line number)
>> and the tweet.
>> Then  I used seq2sparse util:
>>
>> ./bin/mahout seq2sparse -i ../lda-hello-world/tweet-sequence-file -o
>> /tmp/vector -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>>
>> and I created the vectors. (it succeeded without problems)
>>
>> Now, I discovered that lda now it's called cvb (why did you change the name?
>> is
>> a bit confusing.. ) so I tried to run the command, but I got this error
>>
>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>> org.apache.hadoop.io.IntWritable
>> (full stack trace here [3])
>>
>> I also tried the local version:
>>
>> ./bin/mahout cvb0_local -i /tmp/vector/tf-vectors  -d
>> /tmp/vector/dictionary.file-0 --numTopics 100 --docOutputFile /tmp/out
>> --topicOutputFile /tmp/topic
>>
>> (why the parameters' names are different???)
>> But i got a similar error:
>> Exception in thread "main" java.lang.ClassCastException: java.lang.Integer
>> cannot be cast to java.lang.String
>> (full stack trace here [4])
>>
>> Where i'm wrong?? could please help me?
>> Thanks
>> Diego
>>
>> [1] https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
>> [2] https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
>> [3] http://pastebin.com/nV3T74fe
>> [4] http://pastebin.com/JH1xQHuC
>>
>>
>
>

Re: Using LDA in Mahout 0.0.7

Posted by Diego Ceccarelli <di...@gmail.com>.

Thanks Dan, it solved.

On Sun, Oct 28, 2012 at 10:40 PM, DAN HELM <da...@verizon.net> wrote:
> Hi Diego,
> A number of us had the same issue when first working with the new CVB
> algorithm. The vector keys for CVB need to be Integers. You can use the
> rowid utility to convert the output from seq2sparse to the form needed by
> CVB, e.g.,
> http://comments.gmane.org/gmane.comp.apache.mahout.user/13112
> Dan
>
> From: Diego Ceccarelli <di...@gmail.com>
> To: user@mahout.apache.org
> Sent: Sunday, October 28, 2012 5:21 PM
> Subject: Using LDA in Mahout 0.0.7
>
> Dear all,
>
> I'm trying to use the LDA framework in Mahout and I'm experiencing
> some troubles.
> I saw these tutorials [1,2], and I decided to apply lda to a collection with
> 1M of tweets to see how it works. I indexed them with lucene as suggested
> in [2]. Then I discovered that in the last version this is not supported
> and I had to to use a sequence file.
> I saw the util 'seqdirectory' in [2] but it's a bit impractical to create
> one million documents,
> each one with a tweet. So I wrote a small java app that takes a file where
> each line
> is a document and creates a sequence file  <Text,Text>  containing the id
> (line number)
> and the tweet.
> Then  I used seq2sparse util:
>
> ./bin/mahout seq2sparse -i ../lda-hello-world/tweet-sequence-file -o
> /tmp/vector -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>
> and I created the vectors. (it succeeded without problems)
>
> Now, I discovered that lda now it's called cvb (why did you change the name?
> is
> a bit confusing.. ) so I tried to run the command, but I got this error
>
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
> org.apache.hadoop.io.IntWritable
> (full stack trace here [3])
>
> I also tried the local version:
>
> ./bin/mahout cvb0_local -i /tmp/vector/tf-vectors  -d
> /tmp/vector/dictionary.file-0 --numTopics 100 --docOutputFile /tmp/out
> --topicOutputFile /tmp/topic
>
> (why the parameters' names are different???)
> But i got a similar error:
> Exception in thread "main" java.lang.ClassCastException: java.lang.Integer
> cannot be cast to java.lang.String
> (full stack trace here [4])
>
> Where i'm wrong?? could please help me?
> Thanks
> Diego
>
> [1] https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
> [2] https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
> [3] http://pastebin.com/nV3T74fe
> [4] http://pastebin.com/JH1xQHuC
>
>



-- 
Computers are useless. They can only give you answers.
(Pablo Picasso)
_______________
Diego Ceccarelli
High Performance Computing Laboratory
Information Science and Technologies Institute (ISTI)
Italian National Research Council (CNR)
Via Moruzzi, 1
56124 - Pisa - Italy

Phone: +39 050 315 3055
Fax: +39 050 315 2040
________________________________________

Re: Using LDA in Mahout 0.0.7

Posted by DAN HELM <da...@verizon.net>.

Hi Diego, 
A number of us had the same issue when first working with the new CVB algorithm.  The vector keys for CVB need to be Integers.  You can use the rowid utility to convert the output from seq2sparse to the form needed by CVB, e.g.,  
http://comments.gmane.org/gmane.comp.apache.mahout.user/13112 
Dan  

________________________________
 From: Diego Ceccarelli <di...@gmail.com>
To: user@mahout.apache.org 
Sent: Sunday, October 28, 2012 5:21 PM
Subject: Using LDA in Mahout 0.0.7

Dear all,

I'm trying to use the LDA framework in Mahout and I'm experiencing
some troubles. 
I saw these tutorials [1,2], and I decided to apply lda to a collection with
1M of tweets to see how it works. I indexed them with lucene as suggested
in [2]. Then I discovered that in the last version this is not supported
and I had to to use a sequence file. 
I saw the util 'seqdirectory' in [2] but it's a bit impractical to create one million documents,
each one with a tweet. So I wrote a small java app that takes a file where each line 
is a document and creates a sequence file  <Text,Text>  containing the id (line number) 
and the tweet. 
Then  I used seq2sparse util:

./bin/mahout seq2sparse -i ../lda-hello-world/tweet-sequence-file -o /tmp/vector -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow

and I created the vectors. (it succeeded without problems)

Now, I discovered that lda now it's called cvb (why did you change the name? is 
a bit confusing.. ) so I tried to run the command, but I got this error

java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
(full stack trace here [3])

I also tried the local version:

./bin/mahout cvb0_local -i /tmp/vector/tf-vectors   -d /tmp/vector/dictionary.file-0 --numTopics 100 --docOutputFile /tmp/out --topicOutputFile /tmp/topic

(why the parameters' names are different???) 
But i got a similar error:
Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String
(full stack trace here [4])

Where i'm wrong?? could please help me? 
Thanks 
Diego

[1] https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
[2] https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
[3] http://pastebin.com/nV3T74fe
[4] http://pastebin.com/JH1xQHuC