You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Christopher Schindler <id...@hotmail.com> on 2013/08/07 08:34:41 UTC

Using CVB; LdaTopics confusion

Hi all,
A noob question I'm sure but I'm stuck. I'm using CVB to cluster a text index of articles. 
Here's the CVB call:
bin/mahout cvb \ -i /opt/mahout/lucene-sparse-vectors-cvb/matrix \ -dict /opt/mahout/cvb-output/dict.file-* \ -o /opt/mahout/cvb-output/topic_terms.out \ -dt /opt/mahout/cvb-output/topic_dist.out \ -k 200 \-mt /opt/mahout/output/iterations/ \-x 20 -a .25 -ow
I'm trying to access the topics using ldatopics per https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation. 
My latest combination was: bin/mahout ldatopics -i opt/mahout/cvb-output/ -d /opt/mahout/cvb-output/dict.file-*
However, it returns an error stating: ERROR driver.MahoutDriver: : Try the new Collapsed Variation Bayes LDA, try bin/mahout cvb or bin/mahout cvb0_local
The spec is:bin/mahout ldatopics \    -i <input vectors directory> \    -d <input dictionary file> \
What is the vectors directory supposed to be? Many thanks in advance.
Cheers!
Chris 
 		 	   		  

Re: Using CVB; LdaTopics confusion

Posted by Liz Merkhofer <lm...@bericotechnologies.com>.
Christopher -

I had the same confusion with vectordump output on a hadoop cluster. The
solution is that it's not trying to write a file to your hdfs: -o will go
locally. So when I just named a file (it did not want to create a local
directory), it wound up in the /bin I was working out of.

Best,
Liz


On Thu, Aug 8, 2013 at 9:47 PM, Suneel Marthi <su...@yahoo.com>wrote:

> Seems like you are specifying a directory as input to vectordump.
> It should be a 'file' something like
> /opt/mahout/cvb-output-topic/part-xxxx in your case.
>
> Give that a try.
>
>
>
>
> ________________________________
>  From: Christopher Schindler <id...@hotmail.com>
> To: "user@mahout.apache.org" <us...@mahout.apache.org>
> Sent: Thursday, August 8, 2013 8:35 PM
> Subject: RE: Using CVB; LdaTopics confusion
>
>
> Thank you Suneel, I appreciate the pointer. I am using Mahout 0.8 but I
> was following the wiki and not the examples/*.
> I've gotten CVB to run successfully but now vectordump is giving me
> trouble. The call:
> bin/mahout vectordump -i /opt/mahout/cvb-output-topic -o
> /opt/mahout/output -p true -c /opt/mahout/output/vectors.csv -dt
> sequencefile
> The error returned either:Exception in thread "main"
> java.io.FileNotFoundException: /opt/mahout/output/ (No such file or
> directory)[ variant is triggered if I specify -c]
> OR
> Exception in thread "main" java.io.FileNotFoundException:
> /opt/mahout/output (Permission denied)[ no -c param specified]
> Which is odd for several reasons. First, that's a HDFS directory and the
> utilities have been writing and creating directories in that location just
> fine through the prior steps. Second, the output directory does existing in
> HDFS. I've tried various combinations (referencing a directory that
> does/doesn't exist, appending an actual file to the path and others) with
> no success.
> Any insight?
> Cheers!
> Chris
>
> > Date: Wed, 7 Aug 2013 01:58:52 -0700
> > From: suneel_marthi@yahoo.com
> > Subject: Re: Using CVB; LdaTopics confusion
> > To: user@mahout.apache.org
> >
> > If u r using Mahout 0.8, suggest that you look at the CVB invocation in
> examples/bin/cluster-reuters.sh as reference for the sequence of steps (and
> other command line options for each step).
> >
> > ldatopics has been deprecated (in 0.8) and removed completely (in 0.9).
> >
> > Anyways, the input vectors directory in ur case would be -
> '/opt/mahout/cvb-output/topic_dist.out', but I would desist from using it
> as its been deprecated.
> >
> >
> >
> >
> >
> > ________________________________
> >  From: Christopher Schindler <id...@hotmail.com>
> > To: "user@mahout.apache.org" <us...@mahout.apache.org>
> > Sent: Wednesday, August 7, 2013 2:34 AM
> > Subject: Using CVB; LdaTopics confusion
> >
> >
> > Hi all,
> > A noob question I'm sure but I'm stuck. I'm using CVB to cluster a text
> index of articles.
> > Here's the CVB call:
> > bin/mahout cvb \ -i /opt/mahout/lucene-sparse-vectors-cvb/matrix \ -dict
> /opt/mahout/cvb-output/dict.file-* \ -o
> /opt/mahout/cvb-output/topic_terms.out \ -dt
> /opt/mahout/cvb-output/topic_dist.out \ -k 200 \-mt
> /opt/mahout/output/iterations/ \-x 20 -a .25 -ow
> > I'm trying to access the topics using ldatopics per
> https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation
> .
> > My latest combination was: bin/mahout ldatopics -i
> opt/mahout/cvb-output/ -d /opt/mahout/cvb-output/dict.file-*
> > However, it returns an error stating: ERROR driver.MahoutDriver: : Try
> the new Collapsed Variation Bayes LDA, try bin/mahout cvb or bin/mahout
> cvb0_local
> > The spec is:bin/mahout ldatopics \    -i <input vectors directory> \
> -d <input dictionary file> \
> > What is the vectors directory supposed to be? Many thanks in advance.
> > Cheers!
> > Chris
>

Re: Using CVB; LdaTopics confusion

Posted by Suneel Marthi <su...@yahoo.com>.
Seems like you are specifying a directory as input to vectordump.
It should be a 'file' something like /opt/mahout/cvb-output-topic/part-xxxx in your case.

Give that a try.




________________________________
 From: Christopher Schindler <id...@hotmail.com>
To: "user@mahout.apache.org" <us...@mahout.apache.org> 
Sent: Thursday, August 8, 2013 8:35 PM
Subject: RE: Using CVB; LdaTopics confusion
 

Thank you Suneel, I appreciate the pointer. I am using Mahout 0.8 but I was following the wiki and not the examples/*.
I've gotten CVB to run successfully but now vectordump is giving me trouble. The call:
bin/mahout vectordump -i /opt/mahout/cvb-output-topic -o /opt/mahout/output -p true -c /opt/mahout/output/vectors.csv -dt sequencefile
The error returned either:Exception in thread "main" java.io.FileNotFoundException: /opt/mahout/output/ (No such file or directory)[ variant is triggered if I specify -c]
OR 
Exception in thread "main" java.io.FileNotFoundException: /opt/mahout/output (Permission denied)[ no -c param specified]
Which is odd for several reasons. First, that's a HDFS directory and the utilities have been writing and creating directories in that location just fine through the prior steps. Second, the output directory does existing in HDFS. I've tried various combinations (referencing a directory that does/doesn't exist, appending an actual file to the path and others) with no success. 
Any insight?
Cheers!
Chris

> Date: Wed, 7 Aug 2013 01:58:52 -0700
> From: suneel_marthi@yahoo.com
> Subject: Re: Using CVB; LdaTopics confusion
> To: user@mahout.apache.org
> 
> If u r using Mahout 0.8, suggest that you look at the CVB invocation in examples/bin/cluster-reuters.sh as reference for the sequence of steps (and other command line options for each step).
> 
> ldatopics has been deprecated (in 0.8) and removed completely (in 0.9).
> 
> Anyways, the input vectors directory in ur case would be - '/opt/mahout/cvb-output/topic_dist.out', but I would desist from using it as its been deprecated.
> 
> 
> 
> 
> 
> ________________________________
>  From: Christopher Schindler <id...@hotmail.com>
> To: "user@mahout.apache.org" <us...@mahout.apache.org> 
> Sent: Wednesday, August 7, 2013 2:34 AM
> Subject: Using CVB; LdaTopics confusion
>  
> 
> Hi all,
> A noob question I'm sure but I'm stuck. I'm using CVB to cluster a text index of articles. 
> Here's the CVB call:
> bin/mahout cvb \ -i /opt/mahout/lucene-sparse-vectors-cvb/matrix \ -dict /opt/mahout/cvb-output/dict.file-* \ -o /opt/mahout/cvb-output/topic_terms.out \ -dt /opt/mahout/cvb-output/topic_dist.out \ -k 200 \-mt /opt/mahout/output/iterations/ \-x 20 -a .25 -ow
> I'm trying to access the topics using ldatopics per https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation. 
> My latest combination was: bin/mahout ldatopics -i opt/mahout/cvb-output/ -d /opt/mahout/cvb-output/dict.file-*
> However, it returns an error stating: ERROR driver.MahoutDriver: : Try the new Collapsed Variation Bayes LDA, try bin/mahout cvb or bin/mahout cvb0_local
> The spec is:bin/mahout ldatopics \    -i <input vectors directory> \    -d <input dictionary file> \
> What is the vectors directory supposed to be? Many thanks in advance.
> Cheers!
> Chris 

RE: Using CVB; LdaTopics confusion

Posted by Christopher Schindler <id...@hotmail.com>.
Thank you Suneel, I appreciate the pointer. I am using Mahout 0.8 but I was following the wiki and not the examples/*.
I've gotten CVB to run successfully but now vectordump is giving me trouble. The call:
bin/mahout vectordump -i /opt/mahout/cvb-output-topic -o /opt/mahout/output -p true -c /opt/mahout/output/vectors.csv -dt sequencefile
The error returned either:Exception in thread "main" java.io.FileNotFoundException: /opt/mahout/output/ (No such file or directory)[ variant is triggered if I specify -c]
OR 
Exception in thread "main" java.io.FileNotFoundException: /opt/mahout/output (Permission denied)[ no -c param specified]
Which is odd for several reasons. First, that's a HDFS directory and the utilities have been writing and creating directories in that location just fine through the prior steps. Second, the output directory does existing in HDFS. I've tried various combinations (referencing a directory that does/doesn't exist, appending an actual file to the path and others) with no success. 
Any insight?
Cheers!
Chris

> Date: Wed, 7 Aug 2013 01:58:52 -0700
> From: suneel_marthi@yahoo.com
> Subject: Re: Using CVB; LdaTopics confusion
> To: user@mahout.apache.org
> 
> If u r using Mahout 0.8, suggest that you look at the CVB invocation in examples/bin/cluster-reuters.sh as reference for the sequence of steps (and other command line options for each step).
> 
> ldatopics has been deprecated (in 0.8) and removed completely (in 0.9).
> 
> Anyways, the input vectors directory in ur case would be - '/opt/mahout/cvb-output/topic_dist.out', but I would desist from using it as its been deprecated.
> 
> 
> 
> 
> 
> ________________________________
>  From: Christopher Schindler <id...@hotmail.com>
> To: "user@mahout.apache.org" <us...@mahout.apache.org> 
> Sent: Wednesday, August 7, 2013 2:34 AM
> Subject: Using CVB; LdaTopics confusion
>  
> 
> Hi all,
> A noob question I'm sure but I'm stuck. I'm using CVB to cluster a text index of articles. 
> Here's the CVB call:
> bin/mahout cvb \ -i /opt/mahout/lucene-sparse-vectors-cvb/matrix \ -dict /opt/mahout/cvb-output/dict.file-* \ -o /opt/mahout/cvb-output/topic_terms.out \ -dt /opt/mahout/cvb-output/topic_dist.out \ -k 200 \-mt /opt/mahout/output/iterations/ \-x 20 -a .25 -ow
> I'm trying to access the topics using ldatopics per https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation. 
> My latest combination was: bin/mahout ldatopics -i opt/mahout/cvb-output/ -d /opt/mahout/cvb-output/dict.file-*
> However, it returns an error stating: ERROR driver.MahoutDriver: : Try the new Collapsed Variation Bayes LDA, try bin/mahout cvb or bin/mahout cvb0_local
> The spec is:bin/mahout ldatopics \    -i <input vectors directory> \    -d <input dictionary file> \
> What is the vectors directory supposed to be? Many thanks in advance.
> Cheers!
> Chris 
 		 	   		  

Re: Using CVB; LdaTopics confusion

Posted by Suneel Marthi <su...@yahoo.com>.
If u r using Mahout 0.8, suggest that you look at the CVB invocation in examples/bin/cluster-reuters.sh as reference for the sequence of steps (and other command line options for each step).

ldatopics has been deprecated (in 0.8) and removed completely (in 0.9).

Anyways, the input vectors directory in ur case would be - '/opt/mahout/cvb-output/topic_dist.out', but I would desist from using it as its been deprecated.





________________________________
 From: Christopher Schindler <id...@hotmail.com>
To: "user@mahout.apache.org" <us...@mahout.apache.org> 
Sent: Wednesday, August 7, 2013 2:34 AM
Subject: Using CVB; LdaTopics confusion
 

Hi all,
A noob question I'm sure but I'm stuck. I'm using CVB to cluster a text index of articles. 
Here's the CVB call:
bin/mahout cvb \ -i /opt/mahout/lucene-sparse-vectors-cvb/matrix \ -dict /opt/mahout/cvb-output/dict.file-* \ -o /opt/mahout/cvb-output/topic_terms.out \ -dt /opt/mahout/cvb-output/topic_dist.out \ -k 200 \-mt /opt/mahout/output/iterations/ \-x 20 -a .25 -ow
I'm trying to access the topics using ldatopics per https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation. 
My latest combination was: bin/mahout ldatopics -i opt/mahout/cvb-output/ -d /opt/mahout/cvb-output/dict.file-*
However, it returns an error stating: ERROR driver.MahoutDriver: : Try the new Collapsed Variation Bayes LDA, try bin/mahout cvb or bin/mahout cvb0_local
The spec is:bin/mahout ldatopics \    -i <input vectors directory> \    -d <input dictionary file> \
What is the vectors directory supposed to be? Many thanks in advance.
Cheers!
Chris