You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Jérémie Gomez <je...@gmail.com> on 2012/11/14 19:22:50 UTC

Command line : Error using clusterdump after cvb (0.7)

Hi everyone,

I have tried several of the clustering algorithms in mahout and they worked
great, but I have a problem with the cvd implementation of Latent Dirichlet
Allocation. The cvb command works fine but then using clusterdump gives me
the following error :

Exception in thread "main" java.lang.ClassCastException:
org.apache.mahout.math.VectorWritable cannot be cast to
org.apache.mahout.clustering.iterator.ClusterWritable

What I do in details :
1) mahout seqdirectory -c UTF-8 -i inputdir -o sequencefiles
2) mahout seq2sparse -i sequencefiles -o sparsevectors -ow -a
org.apache.lucene.analysis.WhitespaceAnalyzer -x 99 -wt tfidf -s 5 -md 1 -x
90 -ng 2 -ml 50 -seq -n 2
3) mahout rowid -i sparsevectors/tf-vectors -o rowidresult
4) mahout mahout cvb -i rowresult/matrix -dict
sparsevectors/dictionary.file-0 -o topics -dt documents -mt states -ow -k 10
5) mahout clusterdump -i topics -o clusters -of TEXT -n 10 -d
marcelproust/dictionary.file-0 -dt sequencefile

When I run command 5, I get the error above. Unfortunately, I could not
find any working solution after searching the archives, so I though I'd ask
the community !

Thanks a lot in advance.
Jeremie

Re: Command line : Error using clusterdump after cvb (0.7)

Posted by Jérémie Gomez <je...@gmail.com>.

Hi Jake,

It's a great idea indeed. However I'm new to the mahout ; could you give me
some pointers as to where to publish this guide and maybe an example of a
well-formed already existing guide that I could use as an example ?

Thank you !
Jeremie

2012/11/16 Jake Mannix <ja...@gmail.com>

> I'm glad to hear it's working better now!  We should take the results of
> getting this working and turn it into a step-by-step guide for new users,
> others I'm sure could find it useful!
>
>
> On Fri, Nov 16, 2012 at 9:55 AM, Jérémie Gomez <jeremie.gomez@gmail.com
> >wrote:
>
> > Hello Jake,
> >
> > Thank you very much for these interesting pointers : the problem is
> fixed !
> >
> > The problem was indeed that the -sort argument for cvb is broken in 0.7.
> I
> > built from the trunk, and cvb works well. As you suggested, I have run
> cvb
> > with 20 and 30 iterations, and the result is quite interesting.
> >
> > Thanks again for your suggestions, it helped a lot !
> > Jeremie
> >
> > 2012/11/15 Jake Mannix <ja...@gmail.com>
> >
> > > On Thu, Nov 15, 2012 at 3:20 AM, Jérémie Gomez <
> jeremie.gomez@gmail.com
> > > >wrote:
> > >
> > > > Thanks a lot Jake,
> > > >
> > > > I have tried using the vectordump job to retrieve the topics in text
> > > > format, and obtained a text document stating all the terms in the
> > > > dictionary file and numerical values, which I could not successfully
> > > > interpret. My commands were the following:
> > > >
> > > > 1. bin/mahout cvb -inputdir/matrix -o cvboutput -k 20 -x 10 -dict
> > > > seq2sparseoutput/dictionary.file-0 -dt topicdistrib -mt temp/model-1
> > > >
> > > > 2. bin/mahout bin/mahout vectordump -i cvboutput -o termtopics -d
> > > > seq2sparseoutput/dictionary.file-0 --dictionaryType sequencefile
> > > > --vectorSize 5
> > > >
> > > >
> > > > I'm guessing this might be due to the lack of "-sort" command,
> > >
> > >
> > > Yeah, you won't be able to interpret *at all* without sort - you'll
> just
> > > get
> > > the first few terms for the topic, in no order at all (i.e. maybe ones
> > > which are not likely in that topic at all, but have probability > 0).
> > >
> > > Another thing: you're using temp/model-1 - sounds like you're looking
> > > at your *first* iteration of the output?  That's nowhere near
> > convergence,
> > > and your topics will look like garbage - you need to take at least
> > > iteration
> > > 10 or 20 to see some good topics.
> > >
> > > but I can't
> > > > use the -sort command because of a heap memory problem that I can't
> fix
> > > by
> > > > changing the MAHOUT_HEAPSIZE variable, and I get that heap memory
> > problem
> > > > even though I am running the cvb test on a 1,3 Mo dataset...
> > > >
> > >
> > > So are you running on trunk?  I think -sort was broken in the last
> > release,
> > > but has been fixed for a few months now on subversion trunk.
> > >
> > >
> > > >
> > > > Thank you !
> > > >
> > > >
> > > > 2012/11/14 Jake Mannix <ja...@gmail.com>
> > > >
> > > > > Clusterdump doesn't work on LDA output, as LDA doesn't produce
> > > "cluster"
> > > > > objects.
> > > > >
> > > > > If you want to look at the topics for CVB, use vectordump:
> > > > >
> > > > >
> > > > > mahout vectordump -s <path to topics sequence file> --dictionary
> > <path
> > > to
> > > > > dictionary.file-0> --dictionaryType seqfile --vectorSize <num
> entries
> > > > > per topic you
> > > > > want to see> -sort
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Nov 14, 2012 at 10:22 AM, Jérémie Gomez <
> > > jeremie.gomez@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > I have tried several of the clustering algorithms in mahout and
> > they
> > > > > worked
> > > > > > great, but I have a problem with the cvd implementation of Latent
> > > > > Dirichlet
> > > > > > Allocation. The cvb command works fine but then using clusterdump
> > > gives
> > > > > me
> > > > > > the following error :
> > > > > >
> > > > > > Exception in thread "main" java.lang.ClassCastException:
> > > > > > org.apache.mahout.math.VectorWritable cannot be cast to
> > > > > > org.apache.mahout.clustering.iterator.ClusterWritable
> > > > > >
> > > > > > What I do in details :
> > > > > > 1) mahout seqdirectory -c UTF-8 -i inputdir -o sequencefiles
> > > > > > 2) mahout seq2sparse -i sequencefiles -o sparsevectors -ow -a
> > > > > > org.apache.lucene.analysis.WhitespaceAnalyzer -x 99 -wt tfidf -s
> 5
> > > -md
> > > > 1
> > > > > -x
> > > > > > 90 -ng 2 -ml 50 -seq -n 2
> > > > > > 3) mahout rowid -i sparsevectors/tf-vectors -o rowidresult
> > > > > > 4) mahout mahout cvb -i rowresult/matrix -dict
> > > > > > sparsevectors/dictionary.file-0 -o topics -dt documents -mt
> states
> > > -ow
> > > > -k
> > > > > > 10
> > > > > > 5) mahout clusterdump -i topics -o clusters -of TEXT -n 10 -d
> > > > > > marcelproust/dictionary.file-0 -dt sequencefile
> > > > > >
> > > > > > When I run command 5, I get the error above. Unfortunately, I
> could
> > > not
> > > > > > find any working solution after searching the archives, so I
> though
> > > I'd
> > > > > ask
> > > > > > the community !
> > > > > >
> > > > > > Thanks a lot in advance.
> > > > > > Jeremie
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > >   -jake
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > >   -jake
> > >
> >
>
>
>
> --
>
>   -jake
>

Re: Command line : Error using clusterdump after cvb (0.7)

Posted by Jake Mannix <ja...@gmail.com>.

I'm glad to hear it's working better now!  We should take the results of
getting this working and turn it into a step-by-step guide for new users,
others I'm sure could find it useful!


On Fri, Nov 16, 2012 at 9:55 AM, Jérémie Gomez <je...@gmail.com>wrote:

> Hello Jake,
>
> Thank you very much for these interesting pointers : the problem is fixed !
>
> The problem was indeed that the -sort argument for cvb is broken in 0.7. I
> built from the trunk, and cvb works well. As you suggested, I have run cvb
> with 20 and 30 iterations, and the result is quite interesting.
>
> Thanks again for your suggestions, it helped a lot !
> Jeremie
>
> 2012/11/15 Jake Mannix <ja...@gmail.com>
>
> > On Thu, Nov 15, 2012 at 3:20 AM, Jérémie Gomez <jeremie.gomez@gmail.com
> > >wrote:
> >
> > > Thanks a lot Jake,
> > >
> > > I have tried using the vectordump job to retrieve the topics in text
> > > format, and obtained a text document stating all the terms in the
> > > dictionary file and numerical values, which I could not successfully
> > > interpret. My commands were the following:
> > >
> > > 1. bin/mahout cvb -inputdir/matrix -o cvboutput -k 20 -x 10 -dict
> > > seq2sparseoutput/dictionary.file-0 -dt topicdistrib -mt temp/model-1
> > >
> > > 2. bin/mahout bin/mahout vectordump -i cvboutput -o termtopics -d
> > > seq2sparseoutput/dictionary.file-0 --dictionaryType sequencefile
> > > --vectorSize 5
> > >
> > >
> > > I'm guessing this might be due to the lack of "-sort" command,
> >
> >
> > Yeah, you won't be able to interpret *at all* without sort - you'll just
> > get
> > the first few terms for the topic, in no order at all (i.e. maybe ones
> > which are not likely in that topic at all, but have probability > 0).
> >
> > Another thing: you're using temp/model-1 - sounds like you're looking
> > at your *first* iteration of the output?  That's nowhere near
> convergence,
> > and your topics will look like garbage - you need to take at least
> > iteration
> > 10 or 20 to see some good topics.
> >
> > but I can't
> > > use the -sort command because of a heap memory problem that I can't fix
> > by
> > > changing the MAHOUT_HEAPSIZE variable, and I get that heap memory
> problem
> > > even though I am running the cvb test on a 1,3 Mo dataset...
> > >
> >
> > So are you running on trunk?  I think -sort was broken in the last
> release,
> > but has been fixed for a few months now on subversion trunk.
> >
> >
> > >
> > > Thank you !
> > >
> > >
> > > 2012/11/14 Jake Mannix <ja...@gmail.com>
> > >
> > > > Clusterdump doesn't work on LDA output, as LDA doesn't produce
> > "cluster"
> > > > objects.
> > > >
> > > > If you want to look at the topics for CVB, use vectordump:
> > > >
> > > >
> > > > mahout vectordump -s <path to topics sequence file> --dictionary
> <path
> > to
> > > > dictionary.file-0> --dictionaryType seqfile --vectorSize <num entries
> > > > per topic you
> > > > want to see> -sort
> > > >
> > > >
> > > >
> > > > On Wed, Nov 14, 2012 at 10:22 AM, Jérémie Gomez <
> > jeremie.gomez@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I have tried several of the clustering algorithms in mahout and
> they
> > > > worked
> > > > > great, but I have a problem with the cvd implementation of Latent
> > > > Dirichlet
> > > > > Allocation. The cvb command works fine but then using clusterdump
> > gives
> > > > me
> > > > > the following error :
> > > > >
> > > > > Exception in thread "main" java.lang.ClassCastException:
> > > > > org.apache.mahout.math.VectorWritable cannot be cast to
> > > > > org.apache.mahout.clustering.iterator.ClusterWritable
> > > > >
> > > > > What I do in details :
> > > > > 1) mahout seqdirectory -c UTF-8 -i inputdir -o sequencefiles
> > > > > 2) mahout seq2sparse -i sequencefiles -o sparsevectors -ow -a
> > > > > org.apache.lucene.analysis.WhitespaceAnalyzer -x 99 -wt tfidf -s 5
> > -md
> > > 1
> > > > -x
> > > > > 90 -ng 2 -ml 50 -seq -n 2
> > > > > 3) mahout rowid -i sparsevectors/tf-vectors -o rowidresult
> > > > > 4) mahout mahout cvb -i rowresult/matrix -dict
> > > > > sparsevectors/dictionary.file-0 -o topics -dt documents -mt states
> > -ow
> > > -k
> > > > > 10
> > > > > 5) mahout clusterdump -i topics -o clusters -of TEXT -n 10 -d
> > > > > marcelproust/dictionary.file-0 -dt sequencefile
> > > > >
> > > > > When I run command 5, I get the error above. Unfortunately, I could
> > not
> > > > > find any working solution after searching the archives, so I though
> > I'd
> > > > ask
> > > > > the community !
> > > > >
> > > > > Thanks a lot in advance.
> > > > > Jeremie
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > >   -jake
> > > >
> > >
> >
> >
> >
> > --
> >
> >   -jake
> >
>



-- 

  -jake

Re: Command line : Error using clusterdump after cvb (0.7)

Posted by Jérémie Gomez <je...@gmail.com>.

Hello Jake,

Thank you very much for these interesting pointers : the problem is fixed !

The problem was indeed that the -sort argument for cvb is broken in 0.7. I
built from the trunk, and cvb works well. As you suggested, I have run cvb
with 20 and 30 iterations, and the result is quite interesting.

Thanks again for your suggestions, it helped a lot !
Jeremie

2012/11/15 Jake Mannix <ja...@gmail.com>

> On Thu, Nov 15, 2012 at 3:20 AM, Jérémie Gomez <jeremie.gomez@gmail.com
> >wrote:
>
> > Thanks a lot Jake,
> >
> > I have tried using the vectordump job to retrieve the topics in text
> > format, and obtained a text document stating all the terms in the
> > dictionary file and numerical values, which I could not successfully
> > interpret. My commands were the following:
> >
> > 1. bin/mahout cvb -inputdir/matrix -o cvboutput -k 20 -x 10 -dict
> > seq2sparseoutput/dictionary.file-0 -dt topicdistrib -mt temp/model-1
> >
> > 2. bin/mahout bin/mahout vectordump -i cvboutput -o termtopics -d
> > seq2sparseoutput/dictionary.file-0 --dictionaryType sequencefile
> > --vectorSize 5
> >
> >
> > I'm guessing this might be due to the lack of "-sort" command,
>
>
> Yeah, you won't be able to interpret *at all* without sort - you'll just
> get
> the first few terms for the topic, in no order at all (i.e. maybe ones
> which are not likely in that topic at all, but have probability > 0).
>
> Another thing: you're using temp/model-1 - sounds like you're looking
> at your *first* iteration of the output?  That's nowhere near convergence,
> and your topics will look like garbage - you need to take at least
> iteration
> 10 or 20 to see some good topics.
>
> but I can't
> > use the -sort command because of a heap memory problem that I can't fix
> by
> > changing the MAHOUT_HEAPSIZE variable, and I get that heap memory problem
> > even though I am running the cvb test on a 1,3 Mo dataset...
> >
>
> So are you running on trunk?  I think -sort was broken in the last release,
> but has been fixed for a few months now on subversion trunk.
>
>
> >
> > Thank you !
> >
> >
> > 2012/11/14 Jake Mannix <ja...@gmail.com>
> >
> > > Clusterdump doesn't work on LDA output, as LDA doesn't produce
> "cluster"
> > > objects.
> > >
> > > If you want to look at the topics for CVB, use vectordump:
> > >
> > >
> > > mahout vectordump -s <path to topics sequence file> --dictionary <path
> to
> > > dictionary.file-0> --dictionaryType seqfile --vectorSize <num entries
> > > per topic you
> > > want to see> -sort
> > >
> > >
> > >
> > > On Wed, Nov 14, 2012 at 10:22 AM, Jérémie Gomez <
> jeremie.gomez@gmail.com
> > > >wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > I have tried several of the clustering algorithms in mahout and they
> > > worked
> > > > great, but I have a problem with the cvd implementation of Latent
> > > Dirichlet
> > > > Allocation. The cvb command works fine but then using clusterdump
> gives
> > > me
> > > > the following error :
> > > >
> > > > Exception in thread "main" java.lang.ClassCastException:
> > > > org.apache.mahout.math.VectorWritable cannot be cast to
> > > > org.apache.mahout.clustering.iterator.ClusterWritable
> > > >
> > > > What I do in details :
> > > > 1) mahout seqdirectory -c UTF-8 -i inputdir -o sequencefiles
> > > > 2) mahout seq2sparse -i sequencefiles -o sparsevectors -ow -a
> > > > org.apache.lucene.analysis.WhitespaceAnalyzer -x 99 -wt tfidf -s 5
> -md
> > 1
> > > -x
> > > > 90 -ng 2 -ml 50 -seq -n 2
> > > > 3) mahout rowid -i sparsevectors/tf-vectors -o rowidresult
> > > > 4) mahout mahout cvb -i rowresult/matrix -dict
> > > > sparsevectors/dictionary.file-0 -o topics -dt documents -mt states
> -ow
> > -k
> > > > 10
> > > > 5) mahout clusterdump -i topics -o clusters -of TEXT -n 10 -d
> > > > marcelproust/dictionary.file-0 -dt sequencefile
> > > >
> > > > When I run command 5, I get the error above. Unfortunately, I could
> not
> > > > find any working solution after searching the archives, so I though
> I'd
> > > ask
> > > > the community !
> > > >
> > > > Thanks a lot in advance.
> > > > Jeremie
> > > >
> > >
> > >
> > >
> > > --
> > >
> > >   -jake
> > >
> >
>
>
>
> --
>
>   -jake
>

Re: Command line : Error using clusterdump after cvb (0.7)

Posted by Jake Mannix <ja...@gmail.com>.

On Thu, Nov 15, 2012 at 3:20 AM, Jérémie Gomez <je...@gmail.com>wrote:

> Thanks a lot Jake,
>
> I have tried using the vectordump job to retrieve the topics in text
> format, and obtained a text document stating all the terms in the
> dictionary file and numerical values, which I could not successfully
> interpret. My commands were the following:
>
> 1. bin/mahout cvb -inputdir/matrix -o cvboutput -k 20 -x 10 -dict
> seq2sparseoutput/dictionary.file-0 -dt topicdistrib -mt temp/model-1
>
> 2. bin/mahout bin/mahout vectordump -i cvboutput -o termtopics -d
> seq2sparseoutput/dictionary.file-0 --dictionaryType sequencefile
> --vectorSize 5
>
>
> I'm guessing this might be due to the lack of "-sort" command,


Yeah, you won't be able to interpret *at all* without sort - you'll just get
the first few terms for the topic, in no order at all (i.e. maybe ones
which are not likely in that topic at all, but have probability > 0).

Another thing: you're using temp/model-1 - sounds like you're looking
at your *first* iteration of the output?  That's nowhere near convergence,
and your topics will look like garbage - you need to take at least iteration
10 or 20 to see some good topics.

but I can't
> use the -sort command because of a heap memory problem that I can't fix by
> changing the MAHOUT_HEAPSIZE variable, and I get that heap memory problem
> even though I am running the cvb test on a 1,3 Mo dataset...
>

So are you running on trunk?  I think -sort was broken in the last release,
but has been fixed for a few months now on subversion trunk.


>
> Thank you !
>
>
> 2012/11/14 Jake Mannix <ja...@gmail.com>
>
> > Clusterdump doesn't work on LDA output, as LDA doesn't produce "cluster"
> > objects.
> >
> > If you want to look at the topics for CVB, use vectordump:
> >
> >
> > mahout vectordump -s <path to topics sequence file> --dictionary <path to
> > dictionary.file-0> --dictionaryType seqfile --vectorSize <num entries
> > per topic you
> > want to see> -sort
> >
> >
> >
> > On Wed, Nov 14, 2012 at 10:22 AM, Jérémie Gomez <jeremie.gomez@gmail.com
> > >wrote:
> >
> > > Hi everyone,
> > >
> > > I have tried several of the clustering algorithms in mahout and they
> > worked
> > > great, but I have a problem with the cvd implementation of Latent
> > Dirichlet
> > > Allocation. The cvb command works fine but then using clusterdump gives
> > me
> > > the following error :
> > >
> > > Exception in thread "main" java.lang.ClassCastException:
> > > org.apache.mahout.math.VectorWritable cannot be cast to
> > > org.apache.mahout.clustering.iterator.ClusterWritable
> > >
> > > What I do in details :
> > > 1) mahout seqdirectory -c UTF-8 -i inputdir -o sequencefiles
> > > 2) mahout seq2sparse -i sequencefiles -o sparsevectors -ow -a
> > > org.apache.lucene.analysis.WhitespaceAnalyzer -x 99 -wt tfidf -s 5 -md
> 1
> > -x
> > > 90 -ng 2 -ml 50 -seq -n 2
> > > 3) mahout rowid -i sparsevectors/tf-vectors -o rowidresult
> > > 4) mahout mahout cvb -i rowresult/matrix -dict
> > > sparsevectors/dictionary.file-0 -o topics -dt documents -mt states -ow
> -k
> > > 10
> > > 5) mahout clusterdump -i topics -o clusters -of TEXT -n 10 -d
> > > marcelproust/dictionary.file-0 -dt sequencefile
> > >
> > > When I run command 5, I get the error above. Unfortunately, I could not
> > > find any working solution after searching the archives, so I though I'd
> > ask
> > > the community !
> > >
> > > Thanks a lot in advance.
> > > Jeremie
> > >
> >
> >
> >
> > --
> >
> >   -jake
> >
>



-- 

  -jake

Re: Command line : Error using clusterdump after cvb (0.7)

Posted by Jérémie Gomez <je...@gmail.com>.

Thanks a lot Jake,

I have tried using the vectordump job to retrieve the topics in text
format, and obtained a text document stating all the terms in the
dictionary file and numerical values, which I could not successfully
interpret. My commands were the following:

1. bin/mahout cvb -inputdir/matrix -o cvboutput -k 20 -x 10 -dict
seq2sparseoutput/dictionary.file-0 -dt topicdistrib -mt temp/model-1

2. bin/mahout bin/mahout vectordump -i cvboutput -o termtopics -d
seq2sparseoutput/dictionary.file-0 --dictionaryType sequencefile
--vectorSize 5


I'm guessing this might be due to the lack of "-sort" command, but I can't
use the -sort command because of a heap memory problem that I can't fix by
changing the MAHOUT_HEAPSIZE variable, and I get that heap memory problem
even though I am running the cvb test on a 1,3 Mo dataset...

Thank you !


2012/11/14 Jake Mannix <ja...@gmail.com>

> Clusterdump doesn't work on LDA output, as LDA doesn't produce "cluster"
> objects.
>
> If you want to look at the topics for CVB, use vectordump:
>
>
> mahout vectordump -s <path to topics sequence file> --dictionary <path to
> dictionary.file-0> --dictionaryType seqfile --vectorSize <num entries
> per topic you
> want to see> -sort
>
>
>
> On Wed, Nov 14, 2012 at 10:22 AM, Jérémie Gomez <jeremie.gomez@gmail.com
> >wrote:
>
> > Hi everyone,
> >
> > I have tried several of the clustering algorithms in mahout and they
> worked
> > great, but I have a problem with the cvd implementation of Latent
> Dirichlet
> > Allocation. The cvb command works fine but then using clusterdump gives
> me
> > the following error :
> >
> > Exception in thread "main" java.lang.ClassCastException:
> > org.apache.mahout.math.VectorWritable cannot be cast to
> > org.apache.mahout.clustering.iterator.ClusterWritable
> >
> > What I do in details :
> > 1) mahout seqdirectory -c UTF-8 -i inputdir -o sequencefiles
> > 2) mahout seq2sparse -i sequencefiles -o sparsevectors -ow -a
> > org.apache.lucene.analysis.WhitespaceAnalyzer -x 99 -wt tfidf -s 5 -md 1
> -x
> > 90 -ng 2 -ml 50 -seq -n 2
> > 3) mahout rowid -i sparsevectors/tf-vectors -o rowidresult
> > 4) mahout mahout cvb -i rowresult/matrix -dict
> > sparsevectors/dictionary.file-0 -o topics -dt documents -mt states -ow -k
> > 10
> > 5) mahout clusterdump -i topics -o clusters -of TEXT -n 10 -d
> > marcelproust/dictionary.file-0 -dt sequencefile
> >
> > When I run command 5, I get the error above. Unfortunately, I could not
> > find any working solution after searching the archives, so I though I'd
> ask
> > the community !
> >
> > Thanks a lot in advance.
> > Jeremie
> >
>
>
>
> --
>
>   -jake
>

Re: Command line : Error using clusterdump after cvb (0.7)

Posted by Jake Mannix <ja...@gmail.com>.

Clusterdump doesn't work on LDA output, as LDA doesn't produce "cluster"
objects.

If you want to look at the topics for CVB, use vectordump:


mahout vectordump -s <path to topics sequence file> --dictionary <path to
dictionary.file-0> --dictionaryType seqfile --vectorSize <num entries
per topic you
want to see> -sort



On Wed, Nov 14, 2012 at 10:22 AM, Jérémie Gomez <je...@gmail.com>wrote:

> Hi everyone,
>
> I have tried several of the clustering algorithms in mahout and they worked
> great, but I have a problem with the cvd implementation of Latent Dirichlet
> Allocation. The cvb command works fine but then using clusterdump gives me
> the following error :
>
> Exception in thread "main" java.lang.ClassCastException:
> org.apache.mahout.math.VectorWritable cannot be cast to
> org.apache.mahout.clustering.iterator.ClusterWritable
>
> What I do in details :
> 1) mahout seqdirectory -c UTF-8 -i inputdir -o sequencefiles
> 2) mahout seq2sparse -i sequencefiles -o sparsevectors -ow -a
> org.apache.lucene.analysis.WhitespaceAnalyzer -x 99 -wt tfidf -s 5 -md 1 -x
> 90 -ng 2 -ml 50 -seq -n 2
> 3) mahout rowid -i sparsevectors/tf-vectors -o rowidresult
> 4) mahout mahout cvb -i rowresult/matrix -dict
> sparsevectors/dictionary.file-0 -o topics -dt documents -mt states -ow -k
> 10
> 5) mahout clusterdump -i topics -o clusters -of TEXT -n 10 -d
> marcelproust/dictionary.file-0 -dt sequencefile
>
> When I run command 5, I get the error above. Unfortunately, I could not
> find any working solution after searching the archives, so I though I'd ask
> the community !
>
> Thanks a lot in advance.
> Jeremie
>



-- 

  -jake