You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Arni Sumarlidason <Ar...@mdaus.com> on 2012/11/03 23:35:32 UTC

Mahout: CVB: Error

Good Evening, Thank you for reading.. I am trying to run CVB on mahout 0.8...

I have successfully executed the following steps:
./mahout seqdirectory --input /user/root/lda --output text_seq -c UTF-8 -ow -chunk 8
Resulting in 20 chunk files.

./mahout seq2sparse -i text_seq -o text_vec -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
Resulting in 109MB vector, "part-r-00000", "dictionary.file-0", and more.

./mahout rowid -i text_vec/tf-vectors -o sparse-vectors-cvb
Resulting in "docIndex" & "matrix"

Now... When attempting to run the following command,
./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex -o text_lda -k 100 -x 20 -dict text_vec/dictionary.file-0 -dt text_cvb_document -mt text_states
Resulting in an error: No part files found in model path 'text_states/model-1'

Can someone please point me in the right direction?

Best regards,

Arni

Re: Mahout: CVB: Error

Posted by Jake Mannix <ja...@gmail.com>.

On Tue, Nov 6, 2012 at 6:39 AM, Arni Sumarlidason <
Arni.Sumarlidason@mdaus.com> wrote:

> Dan,
>
> Thank you for your time, patience, and detailed response.
>
> Another question; about the results I’m receiving, I don’t understand them
> :(
>
> I’ve run this command: ./mahout cvb -i
> /user/root/sparse-vectors-cvb/matrix -o text_lda_sr -k 100 -x 1 -dict
> text_vec/dictionary.file-0 -dt text_cvb_document_sr -mt text_states_sr
> Followed by: ./mahout vectordump -i /user/root/text_cvb_document_sr -d
> text_vec/dictionary.file-0 -dt sequencefile -o lda-cvb-topics.txt
>
> I get a text file with term frequencies, but I get one line per document I
> originally created vectors from, not the 100 topics? I’m I doing something
> wrong?
>

./mahout vectordump

wants to take in vector files: you can give it the text inputs you started
with (text_cvb_document_sr, in your case), and you'll just see the
"bag-of-words" representation of your input docs.  If you give it one of
the "model" files (in text_lda_sr), then you'll get the term distributions
for the topics.




>
> Thank you for your help,
>
>
> From: DAN HELM [mailto:danielhelm@verizon.net]
> Sent: Sunday, November 04, 2012 6:43 PM
> To: Arni Sumarlidason
> Cc: user@mahout.apache.org
> Subject: Re: Mahout: CVB: Error
>
> Arni,
>
> I had not formally contributed that code but it was posted before via
> email.
>
> Here is an initial approach developed where rowid will output one "part"
> file for each input "part" file processed:
>
>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201208.mbox/%3CCAOeeJfiuPMv=vs8Rm4Co0mjR-BeWgecayV3mHBB+yeBQ_o9M+g@mail.gmail.com%3E
> <
> https://console.mxlogic.com/redir/?zC763hOYqejhOrv7cIcECzAQsCM0oCnSdyszfQXlJIj_w0eaRg5li5g_5t9RrCnrFYsjKyDtXBi5g_1X8A920o_M8tAS5xIOspEY2Rm1aJRh45BNDM58_riMppl2UOcw1bucweaRd78S04y2xfy8DOUZAqdTVeZXTLuZXCXCQdxbPOvDGwEPYp2Bos3jqbzbbNJ5BZeUVdYsedFFCMnWhEw6Z9RrCAq818czahEw6ENH4TfM-u0USyrjdIIczxNEVvsdUdrU4MIZnAaF
> >
>
> And this code will enable one to spit the data up more via an optional "m"
> parameter that enables one to specify how many vectors (max) to write to a
> part file:
>
> http://permalink.gmane.org/gmane.comp.apache.mahout.devel/21821<
> https://console.mxlogic.com/redir/?hP3z1EVud79EVdLzCm6kjhOqejo0drUGWva1nyaKM_w0e4ltx_bHtGSS9_OKAWJPV3PfDUCyYCyrLOtXTLuZXTdTdEr2nDA_fl1hDUO5aMU6CQn6mnzqbbWtNOrUUsrjjdwLQzh0dWjGTd8Qg2gp6kzh0dhzm9KvxYY1NJcSCrpop73zhO-Urn1Le
> >
>
> These were just some quickly developed utilities written some months ago
> when working with CVB.   Obviously there are other ways to split the data
> up.  You could also write software to post-process rowid's Matrix output
> file and split it up so more mappers run.
>
> Lately I have been doing more with the Mahout k-means algorithm since I
> wanted to be able to cluster lots of documents in a timely manner.
>
> As specified in the thread you posted below, the run-time of LDA/CVB is
> very susceptible to the size of the dictionary processed.  This also
> affects mapper heap space requirements where each mapper needs to store
> (dictionary size * k  * 8 * 2) in memory.  We also ran into trouble before
> with running out of mapper heap space when "dictionary size" and/or "k"
> increased a lot so we had to reconfigure hadoop for more mapper heap space
> (changed to 1Gb; no big deal to do).
>
> So yes depending on how much data you are clustering and dictionary size,
> it could take a long time to run.
>
> Dan
>
> From: Arni Sumarlidason <Arni.Sumarlidason@mdaus.com<mailto:
> Arni.Sumarlidason@mdaus.com>>
> To: DAN HELM <da...@verizon.net>>
> Cc: "user@mahout.apache.org<ma...@mahout.apache.org>" <
> user@mahout.apache.org<ma...@mahout.apache.org>>
> Sent: Sunday, November 4, 2012 5:44 PM
> Subject: Re: Mahout: CVB: Error
>
> Dan,
>
> Regarding this thread,
> http://comments.gmane.org/gmane.comp.apache.mahout.user/13641
>
> Did you publish your modification to the rowid function enabling the
> splitting of Matrix files? A single pass on my data takes 9 hours. Does
> this sound reasonable to you? please advise.
>
> Best,
>
> Arni
>
> On Nov 3, 2012, at 8:38 PM, DAN HELM <danielhelm@verizon.net<mailto:
> danielhelm@verizon.net>> wrote:
>
>
> Arni,
>
> I believe you are running with the wrong input for the cvb command:
> ./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex .....
>
> It should be: ./mahout cvb -i /user/root/sparse-vectors-cvb/Matrix .....
>
> docIndex is a file generated by rowid that provides a mapping between the
> original sparse vector keys (in Text format) to the Integer keys assigned
> by rowid.
>
> Dan
>
> From: Arni Sumarlidason <Arni.Sumarlidason@mdaus.com<mailto:
> Arni.Sumarlidason@mdaus.com>>
> To: "user@mahout.apache.org<ma...@mahout.apache.org>" <
> user@mahout.apache.org<ma...@mahout.apache.org>>
> Sent: Saturday, November 3, 2012 6:35 PM
> Subject: Mahout: CVB: Error
>
> Good Evening, Thank you for reading.. I am trying to run CVB on mahout
> 0.8...
>
> I have successfully executed the following steps:
> ./mahout seqdirectory --input /user/root/lda --output text_seq -c UTF-8
> -ow -chunk 8
> Resulting in 20 chunk files.
>
> ./mahout seq2sparse -i text_seq -o text_vec -wt tf -a
> org.apache.lucene.analysis.WhitespaceAnalyzer -ow
> Resulting in 109MB vector, "part-r-00000", "dictionary.file-0", and more.
>
> ./mahout rowid -i text_vec/tf-vectors -o sparse-vectors-cvb
> Resulting in "docIndex" & "matrix"
>
> Now... When attempting to run the following command,
> ./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex -o text_lda -k 100
> -x 20 -dict text_vec/dictionary.file-0 -dt text_cvb_document -mt text_states
> Resulting in an error: No part files found in model path
> 'text_states/model-1'
>
> Can someone please point me in the right direction?
>
> Best regards,
>
> Arni
>
>
>
>
>
>


-- 

  -jake

RE: Mahout: CVB: Error

Posted by Arni Sumarlidason <Ar...@mdaus.com>.

Dan,

Thank you for your time, patience, and detailed response.

Another question; about the results I’m receiving, I don’t understand them :(

I’ve run this command: ./mahout cvb -i /user/root/sparse-vectors-cvb/matrix -o text_lda_sr -k 100 -x 1 -dict text_vec/dictionary.file-0 -dt text_cvb_document_sr -mt text_states_sr
Followed by: ./mahout vectordump -i /user/root/text_cvb_document_sr -d text_vec/dictionary.file-0 -dt sequencefile -o lda-cvb-topics.txt

I get a text file with term frequencies, but I get one line per document I originally created vectors from, not the 100 topics? I’m I doing something wrong?

Thank you for your help,


From: DAN HELM [mailto:danielhelm@verizon.net]
Sent: Sunday, November 04, 2012 6:43 PM
To: Arni Sumarlidason
Cc: user@mahout.apache.org
Subject: Re: Mahout: CVB: Error

Arni,

I had not formally contributed that code but it was posted before via email.

Here is an initial approach developed where rowid will output one "part" file for each input "part" file processed:

http://mail-archives.apache.org/mod_mbox/mahout-user/201208.mbox/%3CCAOeeJfiuPMv=vs8Rm4Co0mjR-BeWgecayV3mHBB+yeBQ_o9M+g@mail.gmail.com%3E<https://console.mxlogic.com/redir/?zC763hOYqejhOrv7cIcECzAQsCM0oCnSdyszfQXlJIj_w0eaRg5li5g_5t9RrCnrFYsjKyDtXBi5g_1X8A920o_M8tAS5xIOspEY2Rm1aJRh45BNDM58_riMppl2UOcw1bucweaRd78S04y2xfy8DOUZAqdTVeZXTLuZXCXCQdxbPOvDGwEPYp2Bos3jqbzbbNJ5BZeUVdYsedFFCMnWhEw6Z9RrCAq818czahEw6ENH4TfM-u0USyrjdIIczxNEVvsdUdrU4MIZnAaF>

And this code will enable one to spit the data up more via an optional "m" parameter that enables one to specify how many vectors (max) to write to a part file:

http://permalink.gmane.org/gmane.comp.apache.mahout.devel/21821<https://console.mxlogic.com/redir/?hP3z1EVud79EVdLzCm6kjhOqejo0drUGWva1nyaKM_w0e4ltx_bHtGSS9_OKAWJPV3PfDUCyYCyrLOtXTLuZXTdTdEr2nDA_fl1hDUO5aMU6CQn6mnzqbbWtNOrUUsrjjdwLQzh0dWjGTd8Qg2gp6kzh0dhzm9KvxYY1NJcSCrpop73zhO-Urn1Le>

These were just some quickly developed utilities written some months ago when working with CVB.   Obviously there are other ways to split the data up.  You could also write software to post-process rowid's Matrix output file and split it up so more mappers run.

Lately I have been doing more with the Mahout k-means algorithm since I wanted to be able to cluster lots of documents in a timely manner.

As specified in the thread you posted below, the run-time of LDA/CVB is very susceptible to the size of the dictionary processed.  This also affects mapper heap space requirements where each mapper needs to store (dictionary size * k  * 8 * 2) in memory.  We also ran into trouble before with running out of mapper heap space when "dictionary size" and/or "k" increased a lot so we had to reconfigure hadoop for more mapper heap space (changed to 1Gb; no big deal to do).

So yes depending on how much data you are clustering and dictionary size, it could take a long time to run.

Dan

From: Arni Sumarlidason <Ar...@mdaus.com>>
To: DAN HELM <da...@verizon.net>>
Cc: "user@mahout.apache.org<ma...@mahout.apache.org>" <us...@mahout.apache.org>>
Sent: Sunday, November 4, 2012 5:44 PM
Subject: Re: Mahout: CVB: Error

Dan,

Regarding this thread,
http://comments.gmane.org/gmane.comp.apache.mahout.user/13641

Did you publish your modification to the rowid function enabling the splitting of Matrix files? A single pass on my data takes 9 hours. Does this sound reasonable to you? please advise.

Best,

Arni

On Nov 3, 2012, at 8:38 PM, DAN HELM <da...@verizon.net>> wrote:


Arni,

I believe you are running with the wrong input for the cvb command: ./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex .....

It should be: ./mahout cvb -i /user/root/sparse-vectors-cvb/Matrix .....

docIndex is a file generated by rowid that provides a mapping between the original sparse vector keys (in Text format) to the Integer keys assigned by rowid.

Dan

From: Arni Sumarlidason <Ar...@mdaus.com>>
To: "user@mahout.apache.org<ma...@mahout.apache.org>" <us...@mahout.apache.org>>
Sent: Saturday, November 3, 2012 6:35 PM
Subject: Mahout: CVB: Error

Good Evening, Thank you for reading.. I am trying to run CVB on mahout 0.8...

I have successfully executed the following steps:
./mahout seqdirectory --input /user/root/lda --output text_seq -c UTF-8 -ow -chunk 8
Resulting in 20 chunk files.

./mahout seq2sparse -i text_seq -o text_vec -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
Resulting in 109MB vector, "part-r-00000", "dictionary.file-0", and more.

./mahout rowid -i text_vec/tf-vectors -o sparse-vectors-cvb
Resulting in "docIndex" & "matrix"

Now... When attempting to run the following command,
./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex -o text_lda -k 100 -x 20 -dict text_vec/dictionary.file-0 -dt text_cvb_document -mt text_states
Resulting in an error: No part files found in model path 'text_states/model-1'

Can someone please point me in the right direction?

Best regards,

Arni

Re: Mahout: CVB: Error

Posted by DAN HELM <da...@verizon.net>.

Arni,

I had not formally contributed that code but it was posted before via email.

Here is an initial approach developed where rowid will output one "part" file for each input "part" file processed:

http://mail-archives.apache.org/mod_mbox/mahout-user/201208.mbox/%3CCAOeeJfiuPMv=vs8Rm4Co0mjR-BeWgecayV3mHBB+yeBQ_o9M+g@mail.gmail.com%3E

And this code will enable one to spit the data up more via an optional "m" parameter that enables one to specify how many vectors (max) to write to a part file:

http://permalink.gmane.org/gmane.comp.apache.mahout.devel/21821

These were just some quickly developed utilities written some months ago when working with CVB.   Obviously there are other ways to split the data up.  You could also write software to post-process rowid's Matrix output file and split it up so more mappers run.

Lately I have been doing more with the Mahout k-means algorithm since I wanted to be able to cluster lots of documents in a timely manner.

As specified in the thread you posted below, the run-time of LDA/CVB is very susceptible to the size of the dictionary processed.  This also affects mapper heap space requirements where each mapper needs to store (dictionary size * k  * 8 * 2) in memory.  We also ran into trouble before with running out of mapper heap space when "dictionary size" and/or "k" increased a lot so we had to reconfigure hadoop for more mapper heap space (changed to 1Gb; no big deal to do).

So yes depending on how much data you are clustering and dictionary size, it could take a long time to run.  

Dan

________________________________
 From: Arni Sumarlidason <Ar...@mdaus.com>
To: DAN HELM <da...@verizon.net> 
Cc: "user@mahout.apache.org" <us...@mahout.apache.org> 
Sent: Sunday, November 4, 2012 5:44 PM
Subject: Re: Mahout: CVB: Error

Dan, 

Regarding this thread, 
http://comments.gmane.org/gmane.comp.apache.mahout.user/13641 

Did you publish your modification to the rowid function enabling the splitting of Matrix files? A single pass on my data takes 9 hours. Does this sound reasonable to you? please advise. 

Best, 

Arni 

On Nov 3, 2012, at 8:38 PM, DAN HELM <da...@verizon.net> wrote: 

Arni, 
>  
>I believe you are running with the wrong input for the cvb command: ./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex ..... 
>  
>It should be: ./mahout cvb -i /user/root/sparse-vectors-cvb/Matrix ..... 
>  
>docIndex is a file generated by rowid that provides a mapping between the original sparse vector keys (in Text format) to the Integer keys assigned by rowid. 
>  
>Dan
>

>
>________________________________
> From: Arni Sumarlidason <Ar...@mdaus.com>
>To: "user@mahout.apache.org" <us...@mahout.apache.org> 
>Sent: Saturday, November 3, 2012 6:35 PM
>Subject: Mahout: CVB: Error
> 
>Good Evening, Thank you for reading.. I am trying to run CVB on mahout 0.8...
>
>I have successfully executed the following steps:
>./mahout seqdirectory --input /user/root/lda --output text_seq -c UTF-8 -ow -chunk 8
>Resulting in 20 chunk files.
>
>./mahout seq2sparse -i text_seq -o text_vec -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>Resulting in 109MB vector, "part-r-00000", "dictionary.file-0", and more.
>
>./mahout rowid -i text_vec/tf-vectors -o sparse-vectors-cvb
>Resulting in "docIndex" & "matrix"
>
>Now... When attempting to run the following command,
>./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex -o text_lda -k 100 -x 20 -dict text_vec/dictionary.file-0 -dt text_cvb_document -mt text_states
>Resulting in an error: No part files found in model path 'text_states/model-1'
>
>Can someone please point me in the right direction?
>
>Best regards,
>
>Arni
>
>
>

Re: Mahout: CVB: Error

Posted by Arni Sumarlidason <Ar...@mdaus.com>.

Dan,

Regarding this thread,
http://comments.gmane.org/gmane.comp.apache.mahout.user/13641

Did you publish your modification to the rowid function enabling the splitting of Matrix files? A single pass on my data takes 9 hours. Does this sound reasonable to you? please advise.

Best,

Arni

On Nov 3, 2012, at 8:38 PM, DAN HELM <da...@verizon.net>> wrote:

Arni,

I believe you are running with the wrong input for the cvb command: ./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex .....

It should be: ./mahout cvb -i /user/root/sparse-vectors-cvb/Matrix .....

docIndex is a file generated by rowid that provides a mapping between the original sparse vector keys (in Text format) to the Integer keys assigned by rowid.

Dan

From: Arni Sumarlidason <Ar...@mdaus.com>>
To: "user@mahout.apache.org<ma...@mahout.apache.org>" <us...@mahout.apache.org>>
Sent: Saturday, November 3, 2012 6:35 PM
Subject: Mahout: CVB: Error

Good Evening, Thank you for reading.. I am trying to run CVB on mahout 0.8...

I have successfully executed the following steps:
./mahout seqdirectory --input /user/root/lda --output text_seq -c UTF-8 -ow -chunk 8
Resulting in 20 chunk files.

./mahout seq2sparse -i text_seq -o text_vec -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
Resulting in 109MB vector, "part-r-00000", "dictionary.file-0", and more.

./mahout rowid -i text_vec/tf-vectors -o sparse-vectors-cvb
Resulting in "docIndex" & "matrix"

Now... When attempting to run the following command,
./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex -o text_lda -k 100 -x 20 -dict text_vec/dictionary.file-0 -dt text_cvb_document -mt text_states
Resulting in an error: No part files found in model path 'text_states/model-1'

Can someone please point me in the right direction?

Best regards,

Arni

Re: Mahout: CVB: Error

Posted by DAN HELM <da...@verizon.net>.

Arni,
 
I believe you are running with the wrong input for the cvb command: ./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex .....
 
It should be: ./mahout cvb -i /user/root/sparse-vectors-cvb/Matrix .....
 
docIndex is a file generated by rowid that provides a mapping between the original sparse vector keys (in Text format) to the Integer keys assigned by rowid.
 
Dan
  

________________________________
 From: Arni Sumarlidason <Ar...@mdaus.com>
To: "user@mahout.apache.org" <us...@mahout.apache.org> 
Sent: Saturday, November 3, 2012 6:35 PM
Subject: Mahout: CVB: Error
  
Good Evening, Thank you for reading.. I am trying to run CVB on mahout 0.8...

I have successfully executed the following steps:
./mahout seqdirectory --input /user/root/lda --output text_seq -c UTF-8 -ow -chunk 8
Resulting in 20 chunk files.

./mahout seq2sparse -i text_seq -o text_vec -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
Resulting in 109MB vector, "part-r-00000", "dictionary.file-0", and more.

./mahout rowid -i text_vec/tf-vectors -o sparse-vectors-cvb
Resulting in "docIndex" & "matrix"

Now... When attempting to run the following command,
./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex -o text_lda -k 100 -x 20 -dict text_vec/dictionary.file-0 -dt text_cvb_document -mt text_states
Resulting in an error: No part files found in model path 'text_states/model-1'

Can someone please point me in the right direction?

Best regards,

Arni