You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by DAN HELM <da...@verizon.net> on 2012/08/01 00:59:48 UTC

Re: Using Mahout to train an CVB and retrieve it's topics

Folcon,
 
seqdirectory should also read files in subfolders.
 
Did you verify that recent seqdirectory command did in fact generate non-empty sequence files?  I believe seqdirectory command just assumes  each file contains a single document (no concatenated documents per file), and that each file contains basic text.
 
If it did generate sequence files this time, I am assume your folder "/user/sgeadmin/text_seq" was copied to hdfs (if not already there) before you ran seq2sparse on it?
 
Dan
 

________________________________
 From: Folcon Red <fo...@gmail.com>
To: DAN HELM <da...@verizon.net> 
Cc: Jake Mannix <ja...@gmail.com>; "user@mahout.apache.org" <us...@mahout.apache.org> 
Sent: Tuesday, July 31, 2012 1:34 PM
Subject: Re: Using Mahout to train an CVB and retrieve it's topics
  
So part-r-00000 inside text_vec is
still SEQorg.apache.hadoop.io.Text%org.apache.mahout.math.VectorWritable
even after moving all the training files into a single folder.

Regards,
Folcon

On 31 July 2012 18:18, Folcon Red <fo...@gmail.com> wrote:

> Hey Everyone,
>
> Ok not certain why   $MAHOUT_HOME/bin/mahout seqdirectory --input /user/
> sgeadmin/text_train --output /user/sgeadmin/text_seq -c UTF-8 -ow  didn't
> produce sequence files, just looking inside text_seq only gives me:
>
> SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
>
> and that's it. Any ideas what I've been doing wrong? Maybe it's because I
> have the files nested in the folder by class, for example a tree view of
> the directory would look like.
>
> text_train -+
>                 | A -+
>                        | 100
>                        | 101
>                        | 103
>                 | B -+
>                        | 102
>                        | 105
>                        | 106
>
> So it's not picking them up? Or perhaps something else? I'm going to try
> some variations to see what happens.
>
> Thanks for the help so far!
>
> Regards,
> Folcon
>
>
> On 29 July 2012 22:10, Folcon Red <fo...@gmail.com> wrote:
>
>> Right, well here's something promising, running $MAHOUT_HOME/bin/mahout
>> seqdumper -i /user/sgeadmin/text_lda/part-m-00000 produced:
>>
>>
>> 7:NaN,29478:NaN,29479:NaN,29480:NaN,29481:NaN,29482:NaN,29483:NaN,29484:NaN,29485:NaN,29486:NaN,29487:NaN,29488:NaN,29489:NaN,29490:NaN,29491:NaN,29492:NaN,29493:NaN,29494:NaN,29495:NaN,29496:NaN,29497:NaN,29498:NaN,29499:NaN,29500:NaN,29501:NaN,29502:NaN,29503:NaN,29504:NaN,29505:NaN,29506:NaN,29507:NaN,29508:NaN,29509:NaN,29510:NaN,29511:NaN,29512:NaN,29513:NaN,29514:NaN,29515:NaN,29516:NaN,29517:NaN,29518:NaN,29519:NaN,29520:NaN,29521:NaN,29522:NaN,29523:NaN,29524:NaN,29525:NaN,29526:NaN,29527:NaN,29528:NaN,29529:NaN,29530:NaN,29531:NaN,29532:NaN,29533:NaN,29534:NaN,29535:NaN}
>>
>> And $MAHOUT_HOME/bin/mahout seqdumper -i
>> /user/sgeadmin/text_vec/tf-vectors/part-r-00000 produced:
>>
>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
>> HADOOP_CONF_DIR=/usr/lib/hadoop/conf
>> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
>> 12/07/29 21:09:17 INFO common.AbstractJob: Command line arguments:
>> {--endPhase=[2147483647],
>> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
>> --startPhase=[0], --tempDir=[temp]}
>> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.mahout.math.VectorWritable
>> Count: 0
>>
>> Kind Regards,
>> Folcon
>>
>> On 29 July 2012 21:29, DAN HELM <da...@verizon.net> wrote:
>>
>>> Yep something went wrong, most likely with the clustering.  part file is
>>> empty.  Should look something like this:
>>>
>>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
>>> org.apache.mahout.math.VectorWritable
>>> Key: 0: Value:
>>> {0:0.06475650422868284,1:0.010728747158503565,2:0.005463535698651016,3:0.023451709705466457}
>>> Key: 1: Value:
>>> {0:0.01838885430227092,1:0.05068404879399544,2:0.002110418548647133,3:0.005566514441743756}
>>> Key: 2: Value:
>>> {0:0.018575587065216153,1:1.236602313900785E-5,2:8.654629660837919E-6,3:5.820637306957196E-6}
>>> ...
>>> ...
>>>
>>> Key refers to a document id and the Value are topic ids:weights assigned
>>> to document id.
>>>
>>> So you need to figure out where things went wrong.  I'm assume folder
>>> /user/sgeadmin/text_lda also has empty part files?  Assuming parts
>>> files are there run seqdumper on one.  Should have data like the above
>>> except in this case the key will be a topic id and the vector will be term
>>> ids:weights.
>>>
>>> You can also check folder /user/sgeadmin/text_vec/tf-vectors to make
>>> sure sparse vectors were generated for your input to cvb.
>>>
>>> Dan
>>>
>>>    *From:* Folcon Red <fo...@gmail.com>
>>> *To:* DAN HELM <da...@verizon.net>
>>> *Cc:* Jake Mannix <ja...@gmail.com>; "user@mahout.apache.org" <
>>> user@mahout.apache.org>
>>> *Sent:* Sunday, July 29, 2012 3:35 PM
>>>
>>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>>>
>>> Thanks Dan and Jake,
>>>
>>> The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/
>>> sgeadmin/text_cvb_document/part-m-00000 is:
>>>
>>> Input Path: /user/sgeadmin/text_cvb_document/part-m-00000
>>> Key class: class org.apache.hadoop.io <http://org.apache.hadoop.io.int/>
>>> .IntWritable Value Class: class org.apache.mahout.math.VectorWritable
>>> Count: 0
>>>
>>> I'm not certain what went wrong.
>>>
>>> Kind Regards,
>>> Folcon
>>>
>>> On 29 July 2012 18:49, DAN HELM <da...@verizon.net> wrote:
>>>
>>> Folcon,
>>>
>>> I'm still using Mahout 0.6 so don't know much about changes in 0.7.
>>>
>>> Your output folder for "dt" looks correct.  The relevant data would be
>>> in  /user/sgeadmin/text_cvb_document/part-m-00000 which is what I would
>>> be passing to a "-s" option.  But I see it says size is only 97 so that
>>> looks suspicious.  So you can just view file (for starters) as: mahout
>>> seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000.  And the
>>> vector dumper command (as Jake pointed out) has a lot more options to
>>> post-process the data but you may want to first just see what is in
>>> that file.
>>>
>>> Dan
>>>
>>>    *From:* Folcon Red <fo...@gmail.com>
>>> *To:* Jake Mannix <ja...@gmail.com>
>>> *Cc:* user@mahout.apache.org; DAN HELM <da...@verizon.net>
>>> *Sent:* Sunday, July 29, 2012 1:08 PM
>>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>>>
>>> Hi Guys,
>>>
>>> Thanks for replying, the problem is whenever I use any -s flag I get the
>>> error "Unexpected -s while processing Job-Specific Options:"
>>>
>>> Also I'm not sure if this is supposed to be the output of -dt
>>>
>>> sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop
>>> starcluster
>>> Found 3 items
>>> -rw-r--r--   3 sgeadmin supergroup          0 2012-07-29 16:51 /user/
>>> sgeadmin/text_cvb_document/_SUCCESS
>>> drwxr-xr-x   - sgeadmin supergroup          0 2012-07-29 16:50 /user/
>>> sgeadmin/text_cvb_document/_logs
>>> -rw-r--r--   3 sgeadmin supergroup         97 2012-07-29 16:51 /user/
>>> sgeadmin/text_cvb_document/part-m-00000
>>>
>>> Should I be using a newer version of mahout? I've just been using the
>>> 0.7 distribution so far as apparently the compiled versions are missing
>>> parts that the distributed ones have.
>>>
>>> Kind Regards,
>>> Folcon
>>>
>>> PS: Thanks for the help so far!
>>>
>>> On 29 July 2012 04:52, Jake Mannix <ja...@gmail.com> wrote:
>>>
>>>
>>>
>>> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <da...@verizon.net>wrote:
>>>
>>> Hi Folcon,
>>>
>>> In the folder you specified for the –dt option for cvb command
>>> there should be sequence files with the document to topic associations
>>> (Key:
>>> IntWritable, Value: VectorWritable).
>>>
>>>
>>> Yeah, this is correct, although this:
>>>
>>>
>>> You can dump in text format as: mahout seqdumper –s <sequence file>
>>>
>>>
>>> is not as good as using vectordumper:
>>>
>>>    mahout vectordump -s <sequence file> --dictionary <path to dictionary.file-0>
>>> \
>>>        --dictionaryType seqfile --vectorSize <num entries per topic you
>>> want to see> -sort
>>>
>>> This joins your topic vectors with the dictionary, then picks out the
>>> top k terms (with their
>>> probabilities) for each topic and prints them to the console (or to the
>>> file you specify with
>>> an --output option).
>>>
>>> *although* I notice now that in trunk when I just checked, VectorDumper.java
>>> had a bug
>>> in it for "vectorSize" - line 175 asks for cmdline option "
>>> numIndexesPerVector" not
>>> vectorSize, ack!  So I took the liberty of fixing that, but you'll need
>>> to "svn up" and rebuild
>>> your jar before using vectordump like this.
>>>
>>>
>>>  So in text output from seqdumper, the key is a document id and the
>>> vector contains
>>> the topics and associated scores associated with the document.  I think
>>> all topics are listed for each
>>> document but many with near zero score.
>>> In my case I used rowid to convert keys of original sparse
>>> document vectors from Text to Integer before running cvb and this
>>> generates a mapping file so I know the textual
>>> keys that correspond to the numeric document ids (since my original
>>> document ids were file names and I created named vectors).
>>> Hope this helps.
>>> Dan
>>>
>>> ________________________________
>>>
>>>  From: Folcon <fo...@gmail.com>
>>> To: user@mahout.apache.org
>>> Sent: Saturday, July 28, 2012 8:28 PM
>>> Subject: Using Mahout to train an CVB and retrieve it's topics
>>>
>>> Hi Everyone,
>>>
>>> I'm posting this as my original message did not seem to appear on the
>>> mailing
>>> list, I'm very sorry if I have done this in error.
>>>
>>> I'm doing this to then use the topics to train a maxent algorithm to
>>> predict the
>>> classes of documents given their topic mixtures. Any further aid in this
>>> direction would be appreciated!
>>>
>>> I've been trying to extract the topics out of my run of cvb. Here's
>>> what I did
>>> so far.
>>>
>>> Ok, so I still don't know how to output the topics, but I have worked
>>> out how to
>>> get the cvb and what I think are the document vectors, however I'm not
>>> having
>>> any luck dumping them, so help here would still be appreciated!
>>>
>>> I set the values of:
>>>     export MAHOUT_HOME=/home/sgeadmin/mahout
>>>     export HADOOP_HOME=/usr/lib/hadoop
>>>     export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
>>>     export HADOOP_CONF_DIR=$HADOOP_HOME/conf
>>> on the master otherwise none of this works.
>>>
>>> So first I uploaded the documents using starclusters put:
>>>     starcluster put mycluster text_train /home/sgeadmin/
>>>     starcluster put mycluster text_test /home/sgeadmin/
>>>
>>> Then I added them to hadoop's hbase filesystem:
>>>     dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop
>>> starcluster
>>>
>>> Then I called Mahout's seqdirectory to turn the text into sequence files
>>>     $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train
>>> --
>>> output /user/sgeadmin/text_seq -c UTF-8 -ow
>>>
>>> Then I called Mahout's seq2parse to turn them into vectors
>>>     $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin
>>> /text_vec -
>>> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>>>
>>> Finally I called cvb, I believe that the -dt flag states where the
>>> inferred
>>> topics should go, but because I haven't yet been able to dump them I
>>> can't
>>> confirm this.
>>>     $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o
>>> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
>>> /user/sgeadmin/text_vec/dictionary.file-0 -dt /user/sgeadmin/text_cvb_document
>>> -
>>> mt /user/sgeadmin/text_states
>>>
>>> The -k flag is the number of topics, the -nt flag is the size of the
>>> dictionary,
>>> I computed this by counting the number of entries of the
>>> dictionary.file-0
>>> inside the vectors(in this case under /user/sgeadmin/text_vec) and -x
>>> is the
>>> number of iterations.
>>>
>>> If you know how to get what the document topic probabilities are from
>>> here, help
>>> would be most appreciated!
>>>
>>> Kind Regards,
>>> Folcon
>>>
>>>
>>>
>>>
>>> --
>>>
>>>   -jake
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: Using Mahout to train an CVB and retrieve it's topics

Posted by DAN HELM <da...@verizon.net>.

Folcon,
 
I believe that is the way to kick up the heap space.  4096M is quite large so I'm sure that should be fine.  I believe you have to re-start Hadoop after making the config changes for them to take effect.
 
As far as determining what documents the numeric ids correspond to....if you ran rowid to convert the text ids to numeric, a mapping file was also created called docIndex that specifies the mapping between numeric and original text ids.
 
Dan
 

________________________________
 From: Folcon Red <fo...@gmail.com>
To: DAN HELM <da...@verizon.net> 
Cc: "user@mahout.apache.org" <us...@mahout.apache.org> 
Sent: Wednesday, August 8, 2012 10:30 AM
Subject: Re: Using Mahout to train an CVB and retrieve it's topics
  
Hi Dan,

Thanks for that, it really helped =D...

Two questions, firstly I keep getting java heap errors when running some of
the map reduce jobs, I've increased the java heap by adding:

<property>
   <name>mapred.child.java.opts</name>
   <value>
     -Xmx4096M
   </value>
</property>

to $HADOOP_HOME/conf/mapred-site.xml, but it doesn't seem to have gotten
rid of them or it sometimes errors with a nonzero status error.

The other question is how do I now run inference with this text_lda, all
the labels etc are now numeric, all the documents in my corpus belong to
one of several labels, so my original intent was to generate the topic
model and then run inference on each document to work out where it belongs.
Then feed in those values to an sgd logistic regression algorithm.

Kind Regards,
Folcon

On 5 August 2012 23:59, DAN HELM <da...@verizon.net> wrote:

> Hi Folcon,
>
> I had that same error some time ago when I first started working with CVB.
>
>
> CVB requires that the key of sparse vectors be Integer not Text.  You can
> convert textual keys from the seq2sparse output using the rowid command,
> e.g.,
>
> http://comments.gmane.org/gmane.comp.apache.mahout.user/13112
>
> That posts talks about a different issue but the sample could I posted is
> what I used.  The "mv" command was used to move a file out that was
> created by rowid, that specifies the mapping between the original text
> ids (most likely file names) to the new integers created by rowid.
>
> Instead of moving out the mapping file I could probably just have run cvblike this:
>
> $MAHOUT cvb \
>     -i ${WORK_DIR}/sparse-vectors-cvb/Matrix \
>     -o ${WORK_DIR}/reuters-cvb -k 150 -ow -x 10 \
> Dan
>
>    *From:* Folcon Red <fo...@gmail.com>
> *To:* DAN HELM <da...@verizon.net>
> *Cc:* "user@mahout.apache.org" <us...@mahout.apache.org>
> *Sent:* Sunday, August 5, 2012 5:29 PM
>
> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>
> Hi Dan,
>
> I've managed to get the text_seq and text_vec generated properly, however
> when I run:
>
> $MAHOUT_HOME/bin/mahout cvb -i /user/root/text_vec/tf-vectors -o
> /user/root/text_lda -k 100 -nt 29536 -x 20 -dict
> /user/root/text_vec/dictionary.file-0 -dt /user/root/text_cvb_document -
> mt /user/root/text_states
>
> I get:
>
> 12/08/05 21:18:04 INFO mapred.JobClient: Task Id :
> attempt_201208051752_0002_m_000003_1, Status : FAILED
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
> org.apache.hadoop.io.IntWritable
> at
>
> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:416)
> at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
> at org.apache.hadoop.mapred.Child.main(Child.java:264)
>
> Task attempt_201208051752_0002_m_000003_1 failed to report status for 600
> seconds. Killing!
>
> Any ideas what's causing this?
>
> Thank you for all the help so far!
>
> Kind Regards,
> Folcon
>
> On 2 August 2012 02:41, Folcon Red <fo...@gmail.com> wrote:
>
> > Thanks Dan,
> >
> > Ok, now for some strange reason it(seq and vec appear to have values
> now,
> > will test the complete cvb later, I should head to bed...) appears to be
> > working, The only things I think I changed was I stopped using absolute
> > paths(referring to text_seq as opposed to /user/root/text_seq) and I'm
> > using root now instead of sgeadmin.
> >
> > Regards,
> > Folcon
> >
> >
> > On 1 August 2012 03:00, DAN HELM <da...@verizon.net> wrote:
> >
> >> Hi Folcon,
> >>
> >> There is no reason to rerun seq2sparse as it is clear something is
> wrongwith the text files being processed by
>
> >> seqdirectory command.
> >>
> >> Based on the keys, I'm assuming the files full path to the input files
> >> are names like /high/59734, etc.  Did you look inside the files to make
> >> sure there is text in them?
> >>
> >> As a test, just create a folder with a simple text file and run that
> >> through seqdirectory and I'll bet you will then see output from
> seqdumpercommand (from
> >> seqdirectory output).
> >>
> >> Thanks, Dan
> >>
>
> >>    *From:* Folcon Red <fo...@gmail.com>
> >> *To:* DAN HELM <da...@verizon.net>
> >> *Cc:* "user@mahout.apache.org" <us...@mahout.apache.org>
> >> *Sent:* Tuesday, July 31, 2012 7:28 PM
>
> >>
> >> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
> >>
> >> Hi Dan,
> >>
> >> It's good to know that seqdirectory reads files in subfolders and I've
> >> dumped out some of the values in the hopes that they will be
> >> enlightening, The values seem to be missing for both the text_seq and
> >> the tokenized-documents.
> >>
> >> So rerunning some of the commands:
> >> $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train
> >> --output /user/sgeadmin/text_seq -c UTF-8 -ow
> >> $MAHOUT_HOME/bin/mahout seq2sparse -i /user/sgeadmin/text_seq -o
> >> /user/sgeadmin/text_vec -wt tf -a
> >> org.apache.lucene.analysis.WhitespaceAnalyzer -ow
> >>
> >> And then doing a seqdumper of text_seq:
> >> SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
> >> [...]
> >> Key: /high/59734: Value:
> >> Key: /high/264596: Value:
> >> Key: /high/341699: Value:
> >> Key: /high/260770: Value:
> >> Key: /high/222320: Value:
> >> Key: /high/198156: Value:
> >> Key: /high/326011: Value:
> >> Key: /high/112050: Value:
> >> Key: /high/306887: Value:
> >> Key: /high/208169: Value:
> >> Key: /high/283464: Value:
> >> Key: /high/168905: Value:
> >> Count: 2548
> >>
> >> root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout seqdumper-i
> >> /user/sgeadmin/text_vec/tf-vectors/part-r-00000
> >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> >> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
> >> HADOOP_CONF_DIR=/conf
> >> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
> >> 12/07/31 23:23:34 INFO common.AbstractJob: Command line arguments:
> >> {--endPhase=[2147483647],
> >> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
> >> --startPhase=[0], --tempDir=[temp]}
> >> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
> >> Key class: class org.apache.hadoop.io.Text Value Class: class
> >> org.apache.mahout.math.VectorWritable
> >> Count: 0
> >>
> >> $MAHOUT_HOME/bin/mahout seqdumper -i
> >> /user/sgeadmin/text_vec/tokenized-documents/part-m-00000
> >> [...]
> >> Key: /high/396063: Value: []
> >> Key: /high/230246: Value: []
> >> Key: /high/136284: Value: []
> >> Key: /high/59734: Value: []
> >> Key: /high/264596: Value: []
> >> Key: /high/341699: Value: []
> >> Key: /high/260770: Value: []
> >> Key: /high/222320: Value: []
> >> Key: /high/198156: Value: []
> >> Key: /high/326011: Value: []
> >> Key: /high/112050: Value: []
> >> Key: /high/306887: Value: []
> >> Key: /high/208169: Value: []
> >> Key: /high/283464: Value: []
> >> Key: /high/168905: Value: []
> >> Count: 2548
> >>
> >>
> >> Running vectordump on the text_vec folder like so:
> >> root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout vectordump-i
> >> /user/sgeadmin/text_vec
> >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> >> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
> >> HADOOP_CONF_DIR=/conf
> >> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
> >> 12/07/31 23:21:08 INFO common.AbstractJob: Command line arguments:
> >> {--endPhase=[2147483647], --input=[/user/sgeadmin/text_vec],
> >> --startPhase=[0], --tempDir=[temp]}
> >> 12/07/31 23:21:08 INFO vectors.VectorDumper: Sort? false
> >> Exception in thread "main" java.lang.IllegalStateException:
> >> file:/user/sgeadmin/text_vec/tf-vectors
> >> at
> >>
> >>
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:63)
> >> at
> org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:194)
> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >> at
> >> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> at
> >>
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >> at
> >>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> at java.lang.reflect.Method.invoke(Method.java:616)
> >> at
> >>
> >>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> at
> >>
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >> at
> >>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> at java.lang.reflect.Method.invoke(Method.java:616)
> >> at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> >> Caused by: java.io.FileNotFoundException:
> >> /user/sgeadmin/harry_old_mallet_vec/tf-vectors (Is a directory)
> >> at java.io.FileInputStream.open(Native Method)
> >> at java.io.FileInputStream.<init>(FileInputStream.java:137)
> >> at
> >> org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>(
> >> RawLocalFileSystem.java:72)
> >> at
> >> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(
> >> RawLocalFileSystem.java:108)
> >> at
> >>
> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:178)
> >> at
> >> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(
> >> ChecksumFileSystem.java:127)
> >> at
> >>
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284)
> >> at
> >>
> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1452)
> >> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile
> >> .java:1431)
> >> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile
> >> .java:1424)
> >> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile
> >> .java:1419)
> >> at
> >> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<
> init
> >> >(SequenceFileIterator.java:58)
> >> at
> >>
> >>
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61)
> >> ... 15 more
> >>
> >> Kind Regards,
> >> Nilu
> >>
> >> On 31 July 2012 23:59, DAN HELM <da...@verizon.net> wrote:
> >>
> >> > Folcon,
> >> >
> >> > seqdirectory should also read files in subfolders.
> >> >
> >> > Did you verify that recent seqdirectory command did in fact generate
> >> > non-empty sequence files?  I believe seqdirectory command just
> assumes
> >> > each file contains a single document (no concatenated documents per
> >> > file), and that each file contains basic text.
> >> >
> >> > If it did generate sequence files this time, I am assume your folder
> >> > "/user/sgeadmin/text_seq" was copied to hdfs (if not already there)
> >> before
> >> > you ran seq2sparse on it?
> >> >
> >> > Dan
> >> >
> >> >    *From:* Folcon Red <fo...@gmail.com>
> >> > *To:* DAN HELM <da...@verizon.net>
> >> > *Cc:* Jake Mannix <ja...@gmail.com>; "user@mahout.apache.org" <
> >> > user@mahout.apache.org>
> >> > *Sent:* Tuesday, July 31, 2012 1:34 PM
> >>
> >> >
> >> > *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
> >> >
> >> > So part-r-00000 inside text_vec is
> >> > still
> SEQorg.apache.hadoop.io.Text%org.apache.mahout.math.VectorWritable
> >> > even after moving all the training files into a single folder.
> >> >
> >> > Regards,
> >> > Folcon
> >> >
> >> > On 31 July 2012 18:18, Folcon Red <fo...@gmail.com> wrote:
> >> >
> >> > > Hey Everyone,
> >> > >
> >> > > Ok not certain why  $MAHOUT_HOME/bin/mahout seqdirectory --input
> >> /user/
> >> > > sgeadmin/text_train --output /user/sgeadmin/text_seq -c UTF-8 -ow
> >> didn't
> >> >
> >> > > produce sequence files, just looking inside text_seq only gives me:
> >> > >
> >> > > SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
> >> > >
> >> > > and that's it. Any ideas what I've been doing wrong? Maybe it's
> >> because I
> >> > > have the files nested in the folder by class, for example a tree
> view
> >> of
> >> > > the directory would look like.
> >> > >
> >> > > text_train -+
> >> > >                | A -+
> >> > >                        | 100
> >> > >                        | 101
> >> > >                        | 103
> >> > >                | B -+
> >> > >                        | 102
> >> > >                        | 105
> >> > >                        | 106
> >> > >
> >> > > So it's not picking them up? Or perhaps something else? I'm going to
> >> try
> >> > > some variations to see what happens.
> >> > >
> >> > > Thanks for the help so far!
> >> > >
> >> > > Regards,
> >> > > Folcon
> >> > >
> >> > >
> >> > > On 29 July 2012 22:10, Folcon Red <fo...@gmail.com> wrote:
> >> > >
> >> > >> Right, well here's something promising, running
> >> $MAHOUT_HOME/bin/mahout
> >> > >> seqdumper -i /user/sgeadmin/text_lda/part-m-00000 produced:
> >> > >>
> >> > >>
> >> > >>
> >> > 7:NaN,29478:NaN,29479:NaN,29480:NaN,29481:NaN,29482:NaN,29483:NaN
> >> ,29484:NaN,29485:NaN,29486:NaN,29487:NaN,29488:NaN,29489:NaN,29490:NaN
> >> ,29491:NaN,29492:NaN,29493:NaN,29494:NaN,29495:NaN,29496:NaN,29497:NaN
> >> ,29498:NaN,29499:NaN,29500:NaN,29501:NaN,29502:NaN,29503:NaN,29504:NaN
> >> ,29505:NaN,29506:NaN,29507:NaN,29508:NaN,29509:NaN,29510:NaN,29511:NaN
> >> ,29512:NaN,29513:NaN,29514:NaN,29515:NaN,29516:NaN,29517:NaN,29518:NaN
> >> ,29519:NaN,29520:NaN,29521:NaN,29522:NaN,29523:NaN,29524:NaN,29525:NaN
> >> ,29526:NaN,29527:NaN,29528:NaN,29529:NaN,29530:NaN,29531:NaN,29532:NaN
> >> ,29533:NaN,29534:NaN,29535:NaN}
> >> > >>
> >> > >> And $MAHOUT_HOME/bin/mahout seqdumper -i
> >> > >> /user/sgeadmin/text_vec/tf-vectors/part-r-00000 produced:
> >> > >>
> >> > >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> >> > >> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
> >> > >> HADOOP_CONF_DIR=/usr/lib/hadoop/conf
> >> > >> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
> >> > >> 12/07/29 21:09:17 INFO common.AbstractJob: Command line arguments:
> >> > >> {--endPhase=[2147483647],
> >> > >> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
> >> > >> --startPhase=[0], --tempDir=[temp]}
> >> > >> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
> >> > >> Key class: class org.apache.hadoop.io.Text Value Class: class
> >> > >> org.apache.mahout.math.VectorWritable
> >> > >> Count: 0
> >> > >>
> >> > >> Kind Regards,
> >> > >> Folcon
> >> > >>
> >> > >> On 29 July 2012 21:29, DAN HELM <da...@verizon.net> wrote:
> >> > >>
> >> > >>> Yep something went wrong, most likely with the clustering.  part
> >> file
> >> > is
> >> > >>> empty.  Should look something like this:
> >> > >>>
> >> > >>> Key class: class org.apache.hadoop.io.IntWritable Value Class:
> class
> >> > >>> org.apache.mahout.math.VectorWritable
> >> > >>> Key: 0: Value:
> >> > >>>
> >> >
> >>
> {0:0.06475650422868284,1:0.010728747158503565,2:0.005463535698651016,3:0.023451709705466457}
> >> > >>> Key: 1: Value:
> >> > >>>
> >> >
> >>
> {0:0.01838885430227092,1:0.05068404879399544,2:0.002110418548647133,3:0.005566514441743756}
> >> > >>> Key: 2: Value:
> >> > >>>
> >> >
> >>
> {0:0.018575587065216153,1:1.236602313900785E-5,2:8.654629660837919E-6,3:5.820637306957196E-6}
> >> > >>> ...
> >> > >>> ...
> >> > >>>
> >> > >>> Key refers to a document id and the Value are topic ids:weights
> >> > assigned
> >> > >>> to document id.
> >> > >>>
> >> > >>> So you need to figure out where things went wrong.  I'm assume
> >> folder
> >> > >>> /user/sgeadmin/text_lda also has empty part files?  Assuming
> parts
> >> > >>> files are there run seqdumper on one.  Should have data like the
> >> above
> >> > >>> except in this case the key will be a topic id and the vector will
> >> be
> >> > term
> >> > >>> ids:weights.
> >> > >>>
> >> > >>> You can also check folder /user/sgeadmin/text_vec/tf-vectors to
> >> make
> >> > >>> sure sparse vectors were generated for your input to cvb.
> >> > >>>
> >> > >>> Dan
> >> > >>>
> >> > >>>    *From:* Folcon Red <fo...@gmail.com>
> >> > >>> *To:* DAN HELM <da...@verizon.net>
> >> > >>> *Cc:* Jake Mannix <ja...@gmail.com>; "
> user@mahout.apache.org"
> >> <
> >> > >>> user@mahout.apache.org>
> >> > >>> *Sent:* Sunday, July 29, 2012 3:35 PM
> >> > >>>
> >> > >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's
> >> topics
> >> >
> >> > >>>
> >> > >>> Thanks Dan and Jake,
> >> > >>>
> >> > >>> The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/
> >> > >>> sgeadmin/text_cvb_document/part-m-00000 is:
> >> > >>>
> >> > >>> Input Path: /user/sgeadmin/text_cvb_document/part-m-00000
> >> > >>> Key class: class org.apache.hadoop.io <
> >> > http://org.apache.hadoop.io.int/>
> >> >
> >> > >>> .IntWritable Value Class: class
> >> org.apache.mahout.math.VectorWritable
> >> > >>> Count: 0
> >> > >>>
> >> > >>> I'm not certain what went wrong.
> >> > >>>
> >> > >>> Kind Regards,
> >> > >>> Folcon
> >> > >>>
> >> > >>> On 29 July 2012 18:49, DAN HELM <da...@verizon.net> wrote:
> >> > >>>
> >> > >>> Folcon,
> >> > >>>
> >> > >>> I'm still using Mahout 0.6 so don't know much about changes in
> 0.7.
> >> > >>>
> >> > >>> Your output folder for "dt" looks correct.  The relevant data
> >> would be
> >> > >>> in  /user/sgeadmin/text_cvb_document/part-m-00000 which is what I
> >> would
> >> > >>> be passing to a "-s" option.  But I see it says size is only 97 so
> >> that
> >> > >>> looks suspicious.  So you can just view file (for starters) as:
> >> mahout
> >> > >>> seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000.  And
> >> the
> >> > >>> vector dumper command (as Jake pointed out) has a lot more options
> >> to
> >> > >>> post-process the data but you may want to first just see what is
> in
> >> > >>> that file.
> >> > >>>
> >> > >>> Dan
> >> > >>>
> >> > >>>    *From:* Folcon Red <fo...@gmail.com>
> >> > >>> *To:* Jake Mannix <ja...@gmail.com>
> >> > >>> *Cc:* user@mahout.apache.org; DAN HELM <da...@verizon.net>
> >> > >>> *Sent:* Sunday, July 29, 2012 1:08 PM
> >> > >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's
> >> topics
> >> >
> >> > >>>
> >> > >>> Hi Guys,
> >> > >>>
> >> > >>> Thanks for replying, the problem is whenever I use any -s flag I
> get
> >> > the
> >> > >>> error "Unexpected -s while processing Job-Specific Options:"
> >> > >>>
> >> > >>> Also I'm not sure if this is supposed to be the output of -dt
> >> > >>>
> >> > >>> sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -
> >> hadoop
> >> > >>> starcluster
> >> > >>> Found 3 items
> >> > >>> -rw-r--r--  3 sgeadmin supergroup          0 2012-07-29 16:51
> >> /user/
> >> > >>> sgeadmin/text_cvb_document/_SUCCESS
> >> > >>> drwxr-xr-x  - sgeadmin supergroup          0 2012-07-29 16:50
> >> /user/
> >> > >>> sgeadmin/text_cvb_document/_logs
> >> > >>> -rw-r--r--  3 sgeadmin supergroup        97 2012-07-29 16:51
> /user/
> >> > >>> sgeadmin/text_cvb_document/part-m-00000
> >> > >>>
> >> > >>> Should I be using a newer version of mahout? I've just been using
> >> the
> >> > >>> 0.7 distribution so far as apparently the compiled versions are
> >> missing
> >> > >>> parts that the distributed ones have.
> >> > >>>
> >> > >>> Kind Regards,
> >> > >>> Folcon
> >> > >>>
> >> > >>> PS: Thanks for the help so far!
> >> > >>>
> >> > >>> On 29 July 2012 04:52, Jake Mannix <ja...@gmail.com> wrote:
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <danielhelm@verizon.net
> >> > >wrote:
> >> > >>>
> >> > >>> Hi Folcon,
> >> > >>>
> >> > >>> In the folder you specified for the –dt option for cvb command
> >> > >>> there should be sequence files with the document to topic
> >> associations
> >> > >>> (Key:
> >> > >>> IntWritable, Value: VectorWritable).
> >> > >>>
> >> > >>>
> >> > >>> Yeah, this is correct, although this:
> >> > >>>
> >> > >>>
> >> > >>> You can dump in text format as: mahout seqdumper –s <sequence
> file>
> >> > >>>
> >> > >>>
> >> > >>> is not as good as using vectordumper:
> >> > >>>
> >> > >>>    mahout vectordump -s <sequence file> --dictionary <path to
> >> > dictionary.file-0>
> >> > >>> \
> >> > >>>        --dictionaryType seqfile --vectorSize <num entries per
> >> topic you
> >> > >>> want to see> -sort
> >> > >>>
> >> > >>> This joins your topic vectors with the dictionary, then picks out
> >> the
> >> > >>> top k terms (with their
> >> > >>> probabilities) for each topic and prints them to the console (or
> to
> >> the
> >> > >>> file you specify with
> >> > >>> an --output option).
> >> > >>>
> >> > >>> *although* I notice now that in trunk when I just checked,
> >> > VectorDumper.java
> >> > >>> had a bug
> >> > >>> in it for "vectorSize" - line 175 asks for cmdline option "
> >> > >>> numIndexesPerVector" not
> >> > >>> vectorSize, ack!  So I took the liberty of fixing that, but
> you'll
> >> need
> >> > >>> to "svn up" and rebuild
> >> > >>> your jar before using vectordump like this.
> >> > >>>
> >> > >>>
> >> > >>>  So in text output from seqdumper, the key is a document id and
> the
> >> > >>> vector contains
> >> > >>> the topics and associated scores associated with the document.  I
> >> think
> >> > >>> all topics are listed for each
> >> > >>> document but many with near zero score.
> >> > >>> In my case I used rowid to convert keys of original sparse
> >> > >>> document vectors from Text to Integer before running cvb and this
> >> > >>> generates a mapping file so I know the textual
> >> > >>> keys that correspond to the numeric document ids (since my
> original
> >> > >>> document ids were file names and I created named vectors).
> >> > >>> Hope this helps.
> >> > >>> Dan
> >> > >>>
> >> > >>> ________________________________
> >> > >>>
> >> > >>>  From: Folcon <fo...@gmail.com>
> >> > >>> To: user@mahout.apache.org
> >> > >>> Sent: Saturday, July 28, 2012 8:28 PM
> >> > >>> Subject: Using Mahout to train an CVB and retrieve it's topics
> >> > >>>
> >> > >>> Hi Everyone,
> >> > >>>
> >> > >>> I'm posting this as my original message did not seem to appear on
> >> the
> >> > >>> mailing
> >> > >>> list, I'm very sorry if I have done this in error.
> >> > >>>
> >> > >>> I'm doing this to then use the topics to train a maxent algorithm
> >> to
> >> > >>> predict the
> >> > >>> classes of documents given their topic mixtures. Any further aid
> in
> >> > this
> >> > >>> direction would be appreciated!
> >> > >>>
> >> > >>> I've been trying to extract the topics out of my run of cvb.
> Here's
> >> > >>> what I did
> >> > >>> so far.
> >> > >>>
> >> > >>> Ok, so I still don't know how to output the topics, but I have
> >> worked
> >> > >>> out how to
> >> > >>> get the cvb and what I think are the document vectors, however
> I'm
> >> not
> >> > >>> having
> >> > >>> any luck dumping them, so help here would still be appreciated!
> >> > >>>
> >> > >>> I set the values of:
> >> > >>>    export MAHOUT_HOME=/home/sgeadmin/mahout
> >> > >>>    export HADOOP_HOME=/usr/lib/hadoop
> >> > >>>    export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
> >> > >>>    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
> >> > >>> on the master otherwise none of this works.
> >> > >>>
> >> > >>> So first I uploaded the documents using starclusters put:
> >> > >>>    starcluster put mycluster text_train /home/sgeadmin/
> >> > >>>    starcluster put mycluster text_test /home/sgeadmin/
> >> > >>>
> >> > >>> Then I added them to hadoop's hbase filesystem:
> >> > >>>    dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop
> >> > >>> starcluster
> >> > >>>
> >> > >>> Then I called Mahout's seqdirectory to turn the text into
> sequence
> >> > files
> >> > >>>    $MAHOUT_HOME/bin/mahout seqdirectory --input
> >> > /user/sgeadmin/text_train
> >> > >>> --
> >> > >>> output /user/sgeadmin/text_seq -c UTF-8 -ow
> >> > >>>
> >> > >>> Then I called Mahout's seq2parse to turn them into vectors
> >> > >>>    $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/
> sgeadmin
> >> > >>> /text_vec -
> >> > >>> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
> >> > >>>
> >> > >>> Finally I called cvb, I believe that the -dt flag states where
> the
> >> > >>> inferred
> >> > >>> topics should go, but because I haven't yet been able to dump
> them I
> >> > >>> can't
> >> > >>> confirm this.
> >> > >>>    $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf
> -vectors
> >> -o
> >> > >>> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
> >> > >>> /user/sgeadmin/text_vec/dictionary.file-0 -dt
> >> > /user/sgeadmin/text_cvb_document
> >> > >>> -
> >> > >>> mt /user/sgeadmin/text_states
> >> > >>>
> >> > >>> The -k flag is the number of topics, the -nt flag is the size of
> >> the
> >> > >>> dictionary,
> >> > >>> I computed this by counting the number of entries of the
> >> > >>> dictionary.file-0
> >> > >>> inside the vectors(in this case under /user/sgeadmin/text_vec)
> and
> >> -x
> >> > >>> is the
> >> > >>> number of iterations.
> >> > >>>
> >> > >>> If you know how to get what the document topic probabilities are
> >> from
> >> > >>> here, help
> >> > >>> would be most appreciated!
> >> > >>>
> >> > >>> Kind Regards,
> >> > >>> Folcon
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> --
> >> > >>>
> >> > >>>  -jake
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>
> >> > >>
> >> > >
> >> >
> >> >
> >> >
> >> >
> >>
> >>
> >>
> >>
> >
> >
>
>
>
>

Re: Using Mahout to train an CVB and retrieve it's topics

Posted by Folcon Red <fo...@gmail.com>.

Hi Dan,

Thanks for that, it really helped =D...

Two questions, firstly I keep getting java heap errors when running some of
the map reduce jobs, I've increased the java heap by adding:

<property>
   <name>mapred.child.java.opts</name>
   <value>
     -Xmx4096M
   </value>
 </property>

to $HADOOP_HOME/conf/mapred-site.xml, but it doesn't seem to have gotten
rid of them or it sometimes errors with a nonzero status error.

The other question is how do I now run inference with this text_lda, all
the labels etc are now numeric, all the documents in my corpus belong to
one of several labels, so my original intent was to generate the topic
model and then run inference on each document to work out where it belongs.
Then feed in those values to an sgd logistic regression algorithm.

Kind Regards,
Folcon

On 5 August 2012 23:59, DAN HELM <da...@verizon.net> wrote:

> Hi Folcon,
>
> I had that same error some time ago when I first started working with CVB.
>
>
> CVB requires that the key of sparse vectors be Integer not Text.  You can
> convert textual keys from the seq2sparse output using the rowid command,
> e.g.,
>
> http://comments.gmane.org/gmane.comp.apache.mahout.user/13112
>
> That posts talks about a different issue but the sample could I posted is
> what I used.  The "mv" command was used to move a file out that was
> created by rowid, that specifies the mapping between the original text
> ids (most likely file names) to the new integers created by rowid.
>
> Instead of moving out the mapping file I could probably just have run cvblike this:
>
> $MAHOUT cvb \
>     -i ${WORK_DIR}/sparse-vectors-cvb/Matrix \
>     -o ${WORK_DIR}/reuters-cvb -k 150 -ow -x 10 \
> Dan
>
>    *From:* Folcon Red <fo...@gmail.com>
> *To:* DAN HELM <da...@verizon.net>
> *Cc:* "user@mahout.apache.org" <us...@mahout.apache.org>
> *Sent:* Sunday, August 5, 2012 5:29 PM
>
> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>
> Hi Dan,
>
> I've managed to get the text_seq and text_vec generated properly, however
> when I run:
>
> $MAHOUT_HOME/bin/mahout cvb -i /user/root/text_vec/tf-vectors -o
> /user/root/text_lda -k 100 -nt 29536 -x 20 -dict
> /user/root/text_vec/dictionary.file-0 -dt /user/root/text_cvb_document -
> mt /user/root/text_states
>
> I get:
>
> 12/08/05 21:18:04 INFO mapred.JobClient: Task Id :
> attempt_201208051752_0002_m_000003_1, Status : FAILED
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
> org.apache.hadoop.io.IntWritable
> at
>
> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:416)
> at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
> at org.apache.hadoop.mapred.Child.main(Child.java:264)
>
> Task attempt_201208051752_0002_m_000003_1 failed to report status for 600
> seconds. Killing!
>
> Any ideas what's causing this?
>
> Thank you for all the help so far!
>
> Kind Regards,
> Folcon
>
> On 2 August 2012 02:41, Folcon Red <fo...@gmail.com> wrote:
>
> > Thanks Dan,
> >
> > Ok, now for some strange reason it(seq and vec appear to have values
> now,
> > will test the complete cvb later, I should head to bed...) appears to be
> > working, The only things I think I changed was I stopped using absolute
> > paths(referring to text_seq as opposed to /user/root/text_seq) and I'm
> > using root now instead of sgeadmin.
> >
> > Regards,
> > Folcon
> >
> >
> > On 1 August 2012 03:00, DAN HELM <da...@verizon.net> wrote:
> >
> >> Hi Folcon,
> >>
> >> There is no reason to rerun seq2sparse as it is clear something is
> wrongwith the text files being processed by
>
> >> seqdirectory command.
> >>
> >> Based on the keys, I'm assuming the files full path to the input files
> >> are names like /high/59734, etc.  Did you look inside the files to make
> >> sure there is text in them?
> >>
> >> As a test, just create a folder with a simple text file and run that
> >> through seqdirectory and I'll bet you will then see output from
> seqdumpercommand (from
> >> seqdirectory output).
> >>
> >> Thanks, Dan
> >>
>
> >>    *From:* Folcon Red <fo...@gmail.com>
> >> *To:* DAN HELM <da...@verizon.net>
> >> *Cc:* "user@mahout.apache.org" <us...@mahout.apache.org>
> >> *Sent:* Tuesday, July 31, 2012 7:28 PM
>
> >>
> >> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
> >>
> >> Hi Dan,
> >>
> >> It's good to know that seqdirectory reads files in subfolders and I've
> >> dumped out some of the values in the hopes that they will be
> >> enlightening, The values seem to be missing for both the text_seq and
> >> the tokenized-documents.
> >>
> >> So rerunning some of the commands:
> >> $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train
> >> --output /user/sgeadmin/text_seq -c UTF-8 -ow
> >> $MAHOUT_HOME/bin/mahout seq2sparse -i /user/sgeadmin/text_seq -o
> >> /user/sgeadmin/text_vec -wt tf -a
> >> org.apache.lucene.analysis.WhitespaceAnalyzer -ow
> >>
> >> And then doing a seqdumper of text_seq:
> >> SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
> >> [...]
> >> Key: /high/59734: Value:
> >> Key: /high/264596: Value:
> >> Key: /high/341699: Value:
> >> Key: /high/260770: Value:
> >> Key: /high/222320: Value:
> >> Key: /high/198156: Value:
> >> Key: /high/326011: Value:
> >> Key: /high/112050: Value:
> >> Key: /high/306887: Value:
> >> Key: /high/208169: Value:
> >> Key: /high/283464: Value:
> >> Key: /high/168905: Value:
> >> Count: 2548
> >>
> >> root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout seqdumper-i
> >> /user/sgeadmin/text_vec/tf-vectors/part-r-00000
> >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> >> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
> >> HADOOP_CONF_DIR=/conf
> >> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
> >> 12/07/31 23:23:34 INFO common.AbstractJob: Command line arguments:
> >> {--endPhase=[2147483647],
> >> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
> >> --startPhase=[0], --tempDir=[temp]}
> >> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
> >> Key class: class org.apache.hadoop.io.Text Value Class: class
> >> org.apache.mahout.math.VectorWritable
> >> Count: 0
> >>
> >> $MAHOUT_HOME/bin/mahout seqdumper -i
> >> /user/sgeadmin/text_vec/tokenized-documents/part-m-00000
> >> [...]
> >> Key: /high/396063: Value: []
> >> Key: /high/230246: Value: []
> >> Key: /high/136284: Value: []
> >> Key: /high/59734: Value: []
> >> Key: /high/264596: Value: []
> >> Key: /high/341699: Value: []
> >> Key: /high/260770: Value: []
> >> Key: /high/222320: Value: []
> >> Key: /high/198156: Value: []
> >> Key: /high/326011: Value: []
> >> Key: /high/112050: Value: []
> >> Key: /high/306887: Value: []
> >> Key: /high/208169: Value: []
> >> Key: /high/283464: Value: []
> >> Key: /high/168905: Value: []
> >> Count: 2548
> >>
> >>
> >> Running vectordump on the text_vec folder like so:
> >> root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout vectordump-i
> >> /user/sgeadmin/text_vec
> >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> >> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
> >> HADOOP_CONF_DIR=/conf
> >> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
> >> 12/07/31 23:21:08 INFO common.AbstractJob: Command line arguments:
> >> {--endPhase=[2147483647], --input=[/user/sgeadmin/text_vec],
> >> --startPhase=[0], --tempDir=[temp]}
> >> 12/07/31 23:21:08 INFO vectors.VectorDumper: Sort? false
> >> Exception in thread "main" java.lang.IllegalStateException:
> >> file:/user/sgeadmin/text_vec/tf-vectors
> >> at
> >>
> >>
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:63)
> >> at
> org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:194)
> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >> at
> >> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> at
> >>
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >> at
> >>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> at java.lang.reflect.Method.invoke(Method.java:616)
> >> at
> >>
> >>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> at
> >>
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >> at
> >>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> at java.lang.reflect.Method.invoke(Method.java:616)
> >> at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> >> Caused by: java.io.FileNotFoundException:
> >> /user/sgeadmin/harry_old_mallet_vec/tf-vectors (Is a directory)
> >> at java.io.FileInputStream.open(Native Method)
> >> at java.io.FileInputStream.<init>(FileInputStream.java:137)
> >> at
> >> org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>(
> >> RawLocalFileSystem.java:72)
> >> at
> >> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(
> >> RawLocalFileSystem.java:108)
> >> at
> >>
> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:178)
> >> at
> >> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(
> >> ChecksumFileSystem.java:127)
> >> at
> >>
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284)
> >> at
> >>
> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1452)
> >> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile
> >> .java:1431)
> >> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile
> >> .java:1424)
> >> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile
> >> .java:1419)
> >> at
> >> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<
> init
> >> >(SequenceFileIterator.java:58)
> >> at
> >>
> >>
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61)
> >> ... 15 more
> >>
> >> Kind Regards,
> >> Nilu
> >>
> >> On 31 July 2012 23:59, DAN HELM <da...@verizon.net> wrote:
> >>
> >> > Folcon,
> >> >
> >> > seqdirectory should also read files in subfolders.
> >> >
> >> > Did you verify that recent seqdirectory command did in fact generate
> >> > non-empty sequence files?  I believe seqdirectory command just
> assumes
> >> > each file contains a single document (no concatenated documents per
> >> > file), and that each file contains basic text.
> >> >
> >> > If it did generate sequence files this time, I am assume your folder
> >> > "/user/sgeadmin/text_seq" was copied to hdfs (if not already there)
> >> before
> >> > you ran seq2sparse on it?
> >> >
> >> > Dan
> >> >
> >> >    *From:* Folcon Red <fo...@gmail.com>
> >> > *To:* DAN HELM <da...@verizon.net>
> >> > *Cc:* Jake Mannix <ja...@gmail.com>; "user@mahout.apache.org" <
> >> > user@mahout.apache.org>
> >> > *Sent:* Tuesday, July 31, 2012 1:34 PM
> >>
> >> >
> >> > *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
> >> >
> >> > So part-r-00000 inside text_vec is
> >> > still
> SEQorg.apache.hadoop.io.Text%org.apache.mahout.math.VectorWritable
> >> > even after moving all the training files into a single folder.
> >> >
> >> > Regards,
> >> > Folcon
> >> >
> >> > On 31 July 2012 18:18, Folcon Red <fo...@gmail.com> wrote:
> >> >
> >> > > Hey Everyone,
> >> > >
> >> > > Ok not certain why  $MAHOUT_HOME/bin/mahout seqdirectory --input
> >> /user/
> >> > > sgeadmin/text_train --output /user/sgeadmin/text_seq -c UTF-8 -ow
> >> didn't
> >> >
> >> > > produce sequence files, just looking inside text_seq only gives me:
> >> > >
> >> > > SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
> >> > >
> >> > > and that's it. Any ideas what I've been doing wrong? Maybe it's
> >> because I
> >> > > have the files nested in the folder by class, for example a tree
> view
> >> of
> >> > > the directory would look like.
> >> > >
> >> > > text_train -+
> >> > >                | A -+
> >> > >                        | 100
> >> > >                        | 101
> >> > >                        | 103
> >> > >                | B -+
> >> > >                        | 102
> >> > >                        | 105
> >> > >                        | 106
> >> > >
> >> > > So it's not picking them up? Or perhaps something else? I'm going to
> >> try
> >> > > some variations to see what happens.
> >> > >
> >> > > Thanks for the help so far!
> >> > >
> >> > > Regards,
> >> > > Folcon
> >> > >
> >> > >
> >> > > On 29 July 2012 22:10, Folcon Red <fo...@gmail.com> wrote:
> >> > >
> >> > >> Right, well here's something promising, running
> >> $MAHOUT_HOME/bin/mahout
> >> > >> seqdumper -i /user/sgeadmin/text_lda/part-m-00000 produced:
> >> > >>
> >> > >>
> >> > >>
> >> > 7:NaN,29478:NaN,29479:NaN,29480:NaN,29481:NaN,29482:NaN,29483:NaN
> >> ,29484:NaN,29485:NaN,29486:NaN,29487:NaN,29488:NaN,29489:NaN,29490:NaN
> >> ,29491:NaN,29492:NaN,29493:NaN,29494:NaN,29495:NaN,29496:NaN,29497:NaN
> >> ,29498:NaN,29499:NaN,29500:NaN,29501:NaN,29502:NaN,29503:NaN,29504:NaN
> >> ,29505:NaN,29506:NaN,29507:NaN,29508:NaN,29509:NaN,29510:NaN,29511:NaN
> >> ,29512:NaN,29513:NaN,29514:NaN,29515:NaN,29516:NaN,29517:NaN,29518:NaN
> >> ,29519:NaN,29520:NaN,29521:NaN,29522:NaN,29523:NaN,29524:NaN,29525:NaN
> >> ,29526:NaN,29527:NaN,29528:NaN,29529:NaN,29530:NaN,29531:NaN,29532:NaN
> >> ,29533:NaN,29534:NaN,29535:NaN}
> >> > >>
> >> > >> And $MAHOUT_HOME/bin/mahout seqdumper -i
> >> > >> /user/sgeadmin/text_vec/tf-vectors/part-r-00000 produced:
> >> > >>
> >> > >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> >> > >> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
> >> > >> HADOOP_CONF_DIR=/usr/lib/hadoop/conf
> >> > >> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
> >> > >> 12/07/29 21:09:17 INFO common.AbstractJob: Command line arguments:
> >> > >> {--endPhase=[2147483647],
> >> > >> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
> >> > >> --startPhase=[0], --tempDir=[temp]}
> >> > >> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
> >> > >> Key class: class org.apache.hadoop.io.Text Value Class: class
> >> > >> org.apache.mahout.math.VectorWritable
> >> > >> Count: 0
> >> > >>
> >> > >> Kind Regards,
> >> > >> Folcon
> >> > >>
> >> > >> On 29 July 2012 21:29, DAN HELM <da...@verizon.net> wrote:
> >> > >>
> >> > >>> Yep something went wrong, most likely with the clustering.  part
> >> file
> >> > is
> >> > >>> empty.  Should look something like this:
> >> > >>>
> >> > >>> Key class: class org.apache.hadoop.io.IntWritable Value Class:
> class
> >> > >>> org.apache.mahout.math.VectorWritable
> >> > >>> Key: 0: Value:
> >> > >>>
> >> >
> >>
> {0:0.06475650422868284,1:0.010728747158503565,2:0.005463535698651016,3:0.023451709705466457}
> >> > >>> Key: 1: Value:
> >> > >>>
> >> >
> >>
> {0:0.01838885430227092,1:0.05068404879399544,2:0.002110418548647133,3:0.005566514441743756}
> >> > >>> Key: 2: Value:
> >> > >>>
> >> >
> >>
> {0:0.018575587065216153,1:1.236602313900785E-5,2:8.654629660837919E-6,3:5.820637306957196E-6}
> >> > >>> ...
> >> > >>> ...
> >> > >>>
> >> > >>> Key refers to a document id and the Value are topic ids:weights
> >> > assigned
> >> > >>> to document id.
> >> > >>>
> >> > >>> So you need to figure out where things went wrong.  I'm assume
> >> folder
> >> > >>> /user/sgeadmin/text_lda also has empty part files?  Assuming
> parts
> >> > >>> files are there run seqdumper on one.  Should have data like the
> >> above
> >> > >>> except in this case the key will be a topic id and the vector will
> >> be
> >> > term
> >> > >>> ids:weights.
> >> > >>>
> >> > >>> You can also check folder /user/sgeadmin/text_vec/tf-vectors to
> >> make
> >> > >>> sure sparse vectors were generated for your input to cvb.
> >> > >>>
> >> > >>> Dan
> >> > >>>
> >> > >>>    *From:* Folcon Red <fo...@gmail.com>
> >> > >>> *To:* DAN HELM <da...@verizon.net>
> >> > >>> *Cc:* Jake Mannix <ja...@gmail.com>; "
> user@mahout.apache.org"
> >> <
> >> > >>> user@mahout.apache.org>
> >> > >>> *Sent:* Sunday, July 29, 2012 3:35 PM
> >> > >>>
> >> > >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's
> >> topics
> >> >
> >> > >>>
> >> > >>> Thanks Dan and Jake,
> >> > >>>
> >> > >>> The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/
> >> > >>> sgeadmin/text_cvb_document/part-m-00000 is:
> >> > >>>
> >> > >>> Input Path: /user/sgeadmin/text_cvb_document/part-m-00000
> >> > >>> Key class: class org.apache.hadoop.io <
> >> > http://org.apache.hadoop.io.int/>
> >> >
> >> > >>> .IntWritable Value Class: class
> >> org.apache.mahout.math.VectorWritable
> >> > >>> Count: 0
> >> > >>>
> >> > >>> I'm not certain what went wrong.
> >> > >>>
> >> > >>> Kind Regards,
> >> > >>> Folcon
> >> > >>>
> >> > >>> On 29 July 2012 18:49, DAN HELM <da...@verizon.net> wrote:
> >> > >>>
> >> > >>> Folcon,
> >> > >>>
> >> > >>> I'm still using Mahout 0.6 so don't know much about changes in
> 0.7.
> >> > >>>
> >> > >>> Your output folder for "dt" looks correct.  The relevant data
> >> would be
> >> > >>> in  /user/sgeadmin/text_cvb_document/part-m-00000 which is what I
> >> would
> >> > >>> be passing to a "-s" option.  But I see it says size is only 97 so
> >> that
> >> > >>> looks suspicious.  So you can just view file (for starters) as:
> >> mahout
> >> > >>> seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000.  And
> >> the
> >> > >>> vector dumper command (as Jake pointed out) has a lot more options
> >> to
> >> > >>> post-process the data but you may want to first just see what is
> in
> >> > >>> that file.
> >> > >>>
> >> > >>> Dan
> >> > >>>
> >> > >>>    *From:* Folcon Red <fo...@gmail.com>
> >> > >>> *To:* Jake Mannix <ja...@gmail.com>
> >> > >>> *Cc:* user@mahout.apache.org; DAN HELM <da...@verizon.net>
> >> > >>> *Sent:* Sunday, July 29, 2012 1:08 PM
> >> > >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's
> >> topics
> >> >
> >> > >>>
> >> > >>> Hi Guys,
> >> > >>>
> >> > >>> Thanks for replying, the problem is whenever I use any -s flag I
> get
> >> > the
> >> > >>> error "Unexpected -s while processing Job-Specific Options:"
> >> > >>>
> >> > >>> Also I'm not sure if this is supposed to be the output of -dt
> >> > >>>
> >> > >>> sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -
> >> hadoop
> >> > >>> starcluster
> >> > >>> Found 3 items
> >> > >>> -rw-r--r--  3 sgeadmin supergroup          0 2012-07-29 16:51
> >> /user/
> >> > >>> sgeadmin/text_cvb_document/_SUCCESS
> >> > >>> drwxr-xr-x  - sgeadmin supergroup          0 2012-07-29 16:50
> >> /user/
> >> > >>> sgeadmin/text_cvb_document/_logs
> >> > >>> -rw-r--r--  3 sgeadmin supergroup        97 2012-07-29 16:51
> /user/
> >> > >>> sgeadmin/text_cvb_document/part-m-00000
> >> > >>>
> >> > >>> Should I be using a newer version of mahout? I've just been using
> >> the
> >> > >>> 0.7 distribution so far as apparently the compiled versions are
> >> missing
> >> > >>> parts that the distributed ones have.
> >> > >>>
> >> > >>> Kind Regards,
> >> > >>> Folcon
> >> > >>>
> >> > >>> PS: Thanks for the help so far!
> >> > >>>
> >> > >>> On 29 July 2012 04:52, Jake Mannix <ja...@gmail.com> wrote:
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <danielhelm@verizon.net
> >> > >wrote:
> >> > >>>
> >> > >>> Hi Folcon,
> >> > >>>
> >> > >>> In the folder you specified for the –dt option for cvb command
> >> > >>> there should be sequence files with the document to topic
> >> associations
> >> > >>> (Key:
> >> > >>> IntWritable, Value: VectorWritable).
> >> > >>>
> >> > >>>
> >> > >>> Yeah, this is correct, although this:
> >> > >>>
> >> > >>>
> >> > >>> You can dump in text format as: mahout seqdumper –s <sequence
> file>
> >> > >>>
> >> > >>>
> >> > >>> is not as good as using vectordumper:
> >> > >>>
> >> > >>>    mahout vectordump -s <sequence file> --dictionary <path to
> >> > dictionary.file-0>
> >> > >>> \
> >> > >>>        --dictionaryType seqfile --vectorSize <num entries per
> >> topic you
> >> > >>> want to see> -sort
> >> > >>>
> >> > >>> This joins your topic vectors with the dictionary, then picks out
> >> the
> >> > >>> top k terms (with their
> >> > >>> probabilities) for each topic and prints them to the console (or
> to
> >> the
> >> > >>> file you specify with
> >> > >>> an --output option).
> >> > >>>
> >> > >>> *although* I notice now that in trunk when I just checked,
> >> > VectorDumper.java
> >> > >>> had a bug
> >> > >>> in it for "vectorSize" - line 175 asks for cmdline option "
> >> > >>> numIndexesPerVector" not
> >> > >>> vectorSize, ack!  So I took the liberty of fixing that, but
> you'll
> >> need
> >> > >>> to "svn up" and rebuild
> >> > >>> your jar before using vectordump like this.
> >> > >>>
> >> > >>>
> >> > >>>  So in text output from seqdumper, the key is a document id and
> the
> >> > >>> vector contains
> >> > >>> the topics and associated scores associated with the document.  I
> >> think
> >> > >>> all topics are listed for each
> >> > >>> document but many with near zero score.
> >> > >>> In my case I used rowid to convert keys of original sparse
> >> > >>> document vectors from Text to Integer before running cvb and this
> >> > >>> generates a mapping file so I know the textual
> >> > >>> keys that correspond to the numeric document ids (since my
> original
> >> > >>> document ids were file names and I created named vectors).
> >> > >>> Hope this helps.
> >> > >>> Dan
> >> > >>>
> >> > >>> ________________________________
> >> > >>>
> >> > >>>  From: Folcon <fo...@gmail.com>
> >> > >>> To: user@mahout.apache.org
> >> > >>> Sent: Saturday, July 28, 2012 8:28 PM
> >> > >>> Subject: Using Mahout to train an CVB and retrieve it's topics
> >> > >>>
> >> > >>> Hi Everyone,
> >> > >>>
> >> > >>> I'm posting this as my original message did not seem to appear on
> >> the
> >> > >>> mailing
> >> > >>> list, I'm very sorry if I have done this in error.
> >> > >>>
> >> > >>> I'm doing this to then use the topics to train a maxent algorithm
> >> to
> >> > >>> predict the
> >> > >>> classes of documents given their topic mixtures. Any further aid
> in
> >> > this
> >> > >>> direction would be appreciated!
> >> > >>>
> >> > >>> I've been trying to extract the topics out of my run of cvb.
> Here's
> >> > >>> what I did
> >> > >>> so far.
> >> > >>>
> >> > >>> Ok, so I still don't know how to output the topics, but I have
> >> worked
> >> > >>> out how to
> >> > >>> get the cvb and what I think are the document vectors, however
> I'm
> >> not
> >> > >>> having
> >> > >>> any luck dumping them, so help here would still be appreciated!
> >> > >>>
> >> > >>> I set the values of:
> >> > >>>    export MAHOUT_HOME=/home/sgeadmin/mahout
> >> > >>>    export HADOOP_HOME=/usr/lib/hadoop
> >> > >>>    export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
> >> > >>>    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
> >> > >>> on the master otherwise none of this works.
> >> > >>>
> >> > >>> So first I uploaded the documents using starclusters put:
> >> > >>>    starcluster put mycluster text_train /home/sgeadmin/
> >> > >>>    starcluster put mycluster text_test /home/sgeadmin/
> >> > >>>
> >> > >>> Then I added them to hadoop's hbase filesystem:
> >> > >>>    dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop
> >> > >>> starcluster
> >> > >>>
> >> > >>> Then I called Mahout's seqdirectory to turn the text into
> sequence
> >> > files
> >> > >>>    $MAHOUT_HOME/bin/mahout seqdirectory --input
> >> > /user/sgeadmin/text_train
> >> > >>> --
> >> > >>> output /user/sgeadmin/text_seq -c UTF-8 -ow
> >> > >>>
> >> > >>> Then I called Mahout's seq2parse to turn them into vectors
> >> > >>>    $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/
> sgeadmin
> >> > >>> /text_vec -
> >> > >>> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
> >> > >>>
> >> > >>> Finally I called cvb, I believe that the -dt flag states where
> the
> >> > >>> inferred
> >> > >>> topics should go, but because I haven't yet been able to dump
> them I
> >> > >>> can't
> >> > >>> confirm this.
> >> > >>>    $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf
> -vectors
> >> -o
> >> > >>> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
> >> > >>> /user/sgeadmin/text_vec/dictionary.file-0 -dt
> >> > /user/sgeadmin/text_cvb_document
> >> > >>> -
> >> > >>> mt /user/sgeadmin/text_states
> >> > >>>
> >> > >>> The -k flag is the number of topics, the -nt flag is the size of
> >> the
> >> > >>> dictionary,
> >> > >>> I computed this by counting the number of entries of the
> >> > >>> dictionary.file-0
> >> > >>> inside the vectors(in this case under /user/sgeadmin/text_vec)
> and
> >> -x
> >> > >>> is the
> >> > >>> number of iterations.
> >> > >>>
> >> > >>> If you know how to get what the document topic probabilities are
> >> from
> >> > >>> here, help
> >> > >>> would be most appreciated!
> >> > >>>
> >> > >>> Kind Regards,
> >> > >>> Folcon
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> --
> >> > >>>
> >> > >>>  -jake
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>
> >> > >>
> >> > >
> >> >
> >> >
> >> >
> >> >
> >>
> >>
> >>
> >>
> >
> >
>
>
>
>

Re: Using Mahout to train an CVB and retrieve it's topics

Posted by DAN HELM <da...@verizon.net>.

Hi Folcon,
 
I had that same error some time ago when I first started working with CVB.  
 
CVB requires that the key of sparse vectors be Integer not Text.  You can convert textual keys from the seq2sparse output using the rowid command, e.g.,
 
http://comments.gmane.org/gmane.comp.apache.mahout.user/13112
 
That posts talks about a different issue but the sample could I posted is what I used.  The "mv" command was used to move a file out that was created by rowid, that specifies the mapping between the original text ids (most likely file names) to the new integers created by rowid.
 
Instead of moving out the mapping file I could probably just have run cvb like this:
 
$MAHOUT cvb \
    -i ${WORK_DIR}/sparse-vectors-cvb/Matrix \
    -o ${WORK_DIR}/reuters-cvb -k 150 -ow -x 10 \

Dan
 

________________________________
 From: Folcon Red <fo...@gmail.com>
To: DAN HELM <da...@verizon.net> 
Cc: "user@mahout.apache.org" <us...@mahout.apache.org> 
Sent: Sunday, August 5, 2012 5:29 PM
Subject: Re: Using Mahout to train an CVB and retrieve it's topics
  
Hi Dan,

I've managed to get the text_seq and text_vec generated properly, however
when I run:

$MAHOUT_HOME/bin/mahout cvb -i /user/root/text_vec/tf-vectors -o
/user/root/text_lda -k 100 -nt 29536 -x 20 -dict
/user/root/text_vec/dictionary.file-0 -dt /user/root/text_cvb_document -
mt /user/root/text_states

I get:

12/08/05 21:18:04 INFO mapred.JobClient: Task Id :
attempt_201208051752_0002_m_000003_1, Status : FAILED
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
org.apache.hadoop.io.IntWritable
at
org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)

Task attempt_201208051752_0002_m_000003_1 failed to report status for 600
seconds. Killing!

Any ideas what's causing this?

Thank you for all the help so far!

Kind Regards,
Folcon

On 2 August 2012 02:41, Folcon Red <fo...@gmail.com> wrote:

> Thanks Dan,
>
> Ok, now for some strange reason it(seq and vec appear to have values now,
> will test the complete cvb later, I should head to bed...) appears to be
> working, The only things I think I changed was I stopped using absolute
> paths(referring to text_seq as opposed to /user/root/text_seq) and I'm
> using root now instead of sgeadmin.
>
> Regards,
> Folcon
>
>
> On 1 August 2012 03:00, DAN HELM <da...@verizon.net> wrote:
>
>> Hi Folcon,
>>
>> There is no reason to rerun seq2sparse as it is clear something is wrongwith the text files being processed by
>> seqdirectory command.
>>
>> Based on the keys, I'm assuming the files full path to the input files
>> are names like /high/59734, etc.  Did you look inside the files to make
>> sure there is text in them?
>>
>> As a test, just create a folder with a simple text file and run that
>> through seqdirectory and I'll bet you will then see output from seqdumpercommand (from
>> seqdirectory output).
>>
>> Thanks, Dan
>>
>>    *From:* Folcon Red <fo...@gmail.com>
>> *To:* DAN HELM <da...@verizon.net>
>> *Cc:* "user@mahout.apache.org" <us...@mahout.apache.org>
>> *Sent:* Tuesday, July 31, 2012 7:28 PM
>>
>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>>
>> Hi Dan,
>>
>> It's good to know that seqdirectory reads files in subfolders and I've
>> dumped out some of the values in the hopes that they will be
>> enlightening, The values seem to be missing for both the text_seq and
>> the tokenized-documents.
>>
>> So rerunning some of the commands:
>> $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train
>> --output /user/sgeadmin/text_seq -c UTF-8 -ow
>> $MAHOUT_HOME/bin/mahout seq2sparse -i /user/sgeadmin/text_seq -o
>> /user/sgeadmin/text_vec -wt tf -a
>> org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>>
>> And then doing a seqdumper of text_seq:
>> SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
>> [...]
>> Key: /high/59734: Value:
>> Key: /high/264596: Value:
>> Key: /high/341699: Value:
>> Key: /high/260770: Value:
>> Key: /high/222320: Value:
>> Key: /high/198156: Value:
>> Key: /high/326011: Value:
>> Key: /high/112050: Value:
>> Key: /high/306887: Value:
>> Key: /high/208169: Value:
>> Key: /high/283464: Value:
>> Key: /high/168905: Value:
>> Count: 2548
>>
>> root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout seqdumper -i
>> /user/sgeadmin/text_vec/tf-vectors/part-r-00000
>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
>> HADOOP_CONF_DIR=/conf
>> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
>> 12/07/31 23:23:34 INFO common.AbstractJob: Command line arguments:
>> {--endPhase=[2147483647],
>> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
>> --startPhase=[0], --tempDir=[temp]}
>> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.mahout.math.VectorWritable
>> Count: 0
>>
>> $MAHOUT_HOME/bin/mahout seqdumper -i
>> /user/sgeadmin/text_vec/tokenized-documents/part-m-00000
>> [...]
>> Key: /high/396063: Value: []
>> Key: /high/230246: Value: []
>> Key: /high/136284: Value: []
>> Key: /high/59734: Value: []
>> Key: /high/264596: Value: []
>> Key: /high/341699: Value: []
>> Key: /high/260770: Value: []
>> Key: /high/222320: Value: []
>> Key: /high/198156: Value: []
>> Key: /high/326011: Value: []
>> Key: /high/112050: Value: []
>> Key: /high/306887: Value: []
>> Key: /high/208169: Value: []
>> Key: /high/283464: Value: []
>> Key: /high/168905: Value: []
>> Count: 2548
>>
>>
>> Running vectordump on the text_vec folder like so:
>> root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout vectordump -i
>> /user/sgeadmin/text_vec
>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
>> HADOOP_CONF_DIR=/conf
>> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
>> 12/07/31 23:21:08 INFO common.AbstractJob: Command line arguments:
>> {--endPhase=[2147483647], --input=[/user/sgeadmin/text_vec],
>> --startPhase=[0], --tempDir=[temp]}
>> 12/07/31 23:21:08 INFO vectors.VectorDumper: Sort? false
>> Exception in thread "main" java.lang.IllegalStateException:
>> file:/user/sgeadmin/text_vec/tf-vectors
>> at
>>
>> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:63)
>> at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:194)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at
>> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:616)
>> at
>>
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:616)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>> Caused by: java.io.FileNotFoundException:
>> /user/sgeadmin/harry_old_mallet_vec/tf-vectors (Is a directory)
>> at java.io.FileInputStream.open(Native Method)
>> at java.io.FileInputStream.<init>(FileInputStream.java:137)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>(
>> RawLocalFileSystem.java:72)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(
>> RawLocalFileSystem.java:108)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:178)
>> at
>> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(
>> ChecksumFileSystem.java:127)
>> at
>> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1452)
>> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile
>> .java:1431)
>> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile
>> .java:1424)
>> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile
>> .java:1419)
>> at
>> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init
>> >(SequenceFileIterator.java:58)
>> at
>>
>> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61)
>> ... 15 more
>>
>> Kind Regards,
>> Nilu
>>
>> On 31 July 2012 23:59, DAN HELM <da...@verizon.net> wrote:
>>
>> > Folcon,
>> >
>> > seqdirectory should also read files in subfolders.
>> >
>> > Did you verify that recent seqdirectory command did in fact generate
>> > non-empty sequence files?  I believe seqdirectory command just assumes
>> > each file contains a single document (no concatenated documents per
>> > file), and that each file contains basic text.
>> >
>> > If it did generate sequence files this time, I am assume your folder
>> > "/user/sgeadmin/text_seq" was copied to hdfs (if not already there)
>> before
>> > you ran seq2sparse on it?
>> >
>> > Dan
>> >
>> >    *From:* Folcon Red <fo...@gmail.com>
>> > *To:* DAN HELM <da...@verizon.net>
>> > *Cc:* Jake Mannix <ja...@gmail.com>; "user@mahout.apache.org" <
>> > user@mahout.apache.org>
>> > *Sent:* Tuesday, July 31, 2012 1:34 PM
>>
>> >
>> > *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>> >
>> > So part-r-00000 inside text_vec is
>> > still SEQorg.apache.hadoop.io.Text%org.apache.mahout.math.VectorWritable
>> > even after moving all the training files into a single folder.
>> >
>> > Regards,
>> > Folcon
>> >
>> > On 31 July 2012 18:18, Folcon Red <fo...@gmail.com> wrote:
>> >
>> > > Hey Everyone,
>> > >
>> > > Ok not certain why  $MAHOUT_HOME/bin/mahout seqdirectory --input
>> /user/
>> > > sgeadmin/text_train --output /user/sgeadmin/text_seq -c UTF-8 -ow
>> didn't
>> >
>> > > produce sequence files, just looking inside text_seq only gives me:
>> > >
>> > > SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
>> > >
>> > > and that's it. Any ideas what I've been doing wrong? Maybe it's
>> because I
>> > > have the files nested in the folder by class, for example a tree view
>> of
>> > > the directory would look like.
>> > >
>> > > text_train -+
>> > >                | A -+
>> > >                        | 100
>> > >                        | 101
>> > >                        | 103
>> > >                | B -+
>> > >                        | 102
>> > >                        | 105
>> > >                        | 106
>> > >
>> > > So it's not picking them up? Or perhaps something else? I'm going to
>> try
>> > > some variations to see what happens.
>> > >
>> > > Thanks for the help so far!
>> > >
>> > > Regards,
>> > > Folcon
>> > >
>> > >
>> > > On 29 July 2012 22:10, Folcon Red <fo...@gmail.com> wrote:
>> > >
>> > >> Right, well here's something promising, running
>> $MAHOUT_HOME/bin/mahout
>> > >> seqdumper -i /user/sgeadmin/text_lda/part-m-00000 produced:
>> > >>
>> > >>
>> > >>
>> > 7:NaN,29478:NaN,29479:NaN,29480:NaN,29481:NaN,29482:NaN,29483:NaN
>> ,29484:NaN,29485:NaN,29486:NaN,29487:NaN,29488:NaN,29489:NaN,29490:NaN
>> ,29491:NaN,29492:NaN,29493:NaN,29494:NaN,29495:NaN,29496:NaN,29497:NaN
>> ,29498:NaN,29499:NaN,29500:NaN,29501:NaN,29502:NaN,29503:NaN,29504:NaN
>> ,29505:NaN,29506:NaN,29507:NaN,29508:NaN,29509:NaN,29510:NaN,29511:NaN
>> ,29512:NaN,29513:NaN,29514:NaN,29515:NaN,29516:NaN,29517:NaN,29518:NaN
>> ,29519:NaN,29520:NaN,29521:NaN,29522:NaN,29523:NaN,29524:NaN,29525:NaN
>> ,29526:NaN,29527:NaN,29528:NaN,29529:NaN,29530:NaN,29531:NaN,29532:NaN
>> ,29533:NaN,29534:NaN,29535:NaN}
>> > >>
>> > >> And $MAHOUT_HOME/bin/mahout seqdumper -i
>> > >> /user/sgeadmin/text_vec/tf-vectors/part-r-00000 produced:
>> > >>
>> > >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>> > >> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
>> > >> HADOOP_CONF_DIR=/usr/lib/hadoop/conf
>> > >> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
>> > >> 12/07/29 21:09:17 INFO common.AbstractJob: Command line arguments:
>> > >> {--endPhase=[2147483647],
>> > >> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
>> > >> --startPhase=[0], --tempDir=[temp]}
>> > >> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
>> > >> Key class: class org.apache.hadoop.io.Text Value Class: class
>> > >> org.apache.mahout.math.VectorWritable
>> > >> Count: 0
>> > >>
>> > >> Kind Regards,
>> > >> Folcon
>> > >>
>> > >> On 29 July 2012 21:29, DAN HELM <da...@verizon.net> wrote:
>> > >>
>> > >>> Yep something went wrong, most likely with the clustering.  part
>> file
>> > is
>> > >>> empty.  Should look something like this:
>> > >>>
>> > >>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
>> > >>> org.apache.mahout.math.VectorWritable
>> > >>> Key: 0: Value:
>> > >>>
>> >
>> {0:0.06475650422868284,1:0.010728747158503565,2:0.005463535698651016,3:0.023451709705466457}
>> > >>> Key: 1: Value:
>> > >>>
>> >
>> {0:0.01838885430227092,1:0.05068404879399544,2:0.002110418548647133,3:0.005566514441743756}
>> > >>> Key: 2: Value:
>> > >>>
>> >
>> {0:0.018575587065216153,1:1.236602313900785E-5,2:8.654629660837919E-6,3:5.820637306957196E-6}
>> > >>> ...
>> > >>> ...
>> > >>>
>> > >>> Key refers to a document id and the Value are topic ids:weights
>> > assigned
>> > >>> to document id.
>> > >>>
>> > >>> So you need to figure out where things went wrong.  I'm assume
>> folder
>> > >>> /user/sgeadmin/text_lda also has empty part files?  Assuming parts
>> > >>> files are there run seqdumper on one.  Should have data like the
>> above
>> > >>> except in this case the key will be a topic id and the vector will
>> be
>> > term
>> > >>> ids:weights.
>> > >>>
>> > >>> You can also check folder /user/sgeadmin/text_vec/tf-vectors to
>> make
>> > >>> sure sparse vectors were generated for your input to cvb.
>> > >>>
>> > >>> Dan
>> > >>>
>> > >>>    *From:* Folcon Red <fo...@gmail.com>
>> > >>> *To:* DAN HELM <da...@verizon.net>
>> > >>> *Cc:* Jake Mannix <ja...@gmail.com>; "user@mahout.apache.org"
>> <
>> > >>> user@mahout.apache.org>
>> > >>> *Sent:* Sunday, July 29, 2012 3:35 PM
>> > >>>
>> > >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's
>> topics
>> >
>> > >>>
>> > >>> Thanks Dan and Jake,
>> > >>>
>> > >>> The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/
>> > >>> sgeadmin/text_cvb_document/part-m-00000 is:
>> > >>>
>> > >>> Input Path: /user/sgeadmin/text_cvb_document/part-m-00000
>> > >>> Key class: class org.apache.hadoop.io <
>> > http://org.apache.hadoop.io.int/>
>> >
>> > >>> .IntWritable Value Class: class
>> org.apache.mahout.math.VectorWritable
>> > >>> Count: 0
>> > >>>
>> > >>> I'm not certain what went wrong.
>> > >>>
>> > >>> Kind Regards,
>> > >>> Folcon
>> > >>>
>> > >>> On 29 July 2012 18:49, DAN HELM <da...@verizon.net> wrote:
>> > >>>
>> > >>> Folcon,
>> > >>>
>> > >>> I'm still using Mahout 0.6 so don't know much about changes in 0.7.
>> > >>>
>> > >>> Your output folder for "dt" looks correct.  The relevant data
>> would be
>> > >>> in  /user/sgeadmin/text_cvb_document/part-m-00000 which is what I
>> would
>> > >>> be passing to a "-s" option.  But I see it says size is only 97 so
>> that
>> > >>> looks suspicious.  So you can just view file (for starters) as:
>> mahout
>> > >>> seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000.  And
>> the
>> > >>> vector dumper command (as Jake pointed out) has a lot more options
>> to
>> > >>> post-process the data but you may want to first just see what is in
>> > >>> that file.
>> > >>>
>> > >>> Dan
>> > >>>
>> > >>>    *From:* Folcon Red <fo...@gmail.com>
>> > >>> *To:* Jake Mannix <ja...@gmail.com>
>> > >>> *Cc:* user@mahout.apache.org; DAN HELM <da...@verizon.net>
>> > >>> *Sent:* Sunday, July 29, 2012 1:08 PM
>> > >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's
>> topics
>> >
>> > >>>
>> > >>> Hi Guys,
>> > >>>
>> > >>> Thanks for replying, the problem is whenever I use any -s flag I get
>> > the
>> > >>> error "Unexpected -s while processing Job-Specific Options:"
>> > >>>
>> > >>> Also I'm not sure if this is supposed to be the output of -dt
>> > >>>
>> > >>> sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -
>> hadoop
>> > >>> starcluster
>> > >>> Found 3 items
>> > >>> -rw-r--r--  3 sgeadmin supergroup          0 2012-07-29 16:51
>> /user/
>> > >>> sgeadmin/text_cvb_document/_SUCCESS
>> > >>> drwxr-xr-x  - sgeadmin supergroup          0 2012-07-29 16:50
>> /user/
>> > >>> sgeadmin/text_cvb_document/_logs
>> > >>> -rw-r--r--  3 sgeadmin supergroup        97 2012-07-29 16:51 /user/
>> > >>> sgeadmin/text_cvb_document/part-m-00000
>> > >>>
>> > >>> Should I be using a newer version of mahout? I've just been using
>> the
>> > >>> 0.7 distribution so far as apparently the compiled versions are
>> missing
>> > >>> parts that the distributed ones have.
>> > >>>
>> > >>> Kind Regards,
>> > >>> Folcon
>> > >>>
>> > >>> PS: Thanks for the help so far!
>> > >>>
>> > >>> On 29 July 2012 04:52, Jake Mannix <ja...@gmail.com> wrote:
>> > >>>
>> > >>>
>> > >>>
>> > >>> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <danielhelm@verizon.net
>> > >wrote:
>> > >>>
>> > >>> Hi Folcon,
>> > >>>
>> > >>> In the folder you specified for the –dt option for cvb command
>> > >>> there should be sequence files with the document to topic
>> associations
>> > >>> (Key:
>> > >>> IntWritable, Value: VectorWritable).
>> > >>>
>> > >>>
>> > >>> Yeah, this is correct, although this:
>> > >>>
>> > >>>
>> > >>> You can dump in text format as: mahout seqdumper –s <sequence file>
>> > >>>
>> > >>>
>> > >>> is not as good as using vectordumper:
>> > >>>
>> > >>>    mahout vectordump -s <sequence file> --dictionary <path to
>> > dictionary.file-0>
>> > >>> \
>> > >>>        --dictionaryType seqfile --vectorSize <num entries per
>> topic you
>> > >>> want to see> -sort
>> > >>>
>> > >>> This joins your topic vectors with the dictionary, then picks out
>> the
>> > >>> top k terms (with their
>> > >>> probabilities) for each topic and prints them to the console (or to
>> the
>> > >>> file you specify with
>> > >>> an --output option).
>> > >>>
>> > >>> *although* I notice now that in trunk when I just checked,
>> > VectorDumper.java
>> > >>> had a bug
>> > >>> in it for "vectorSize" - line 175 asks for cmdline option "
>> > >>> numIndexesPerVector" not
>> > >>> vectorSize, ack!  So I took the liberty of fixing that, but you'll
>> need
>> > >>> to "svn up" and rebuild
>> > >>> your jar before using vectordump like this.
>> > >>>
>> > >>>
>> > >>>  So in text output from seqdumper, the key is a document id and the
>> > >>> vector contains
>> > >>> the topics and associated scores associated with the document.  I
>> think
>> > >>> all topics are listed for each
>> > >>> document but many with near zero score.
>> > >>> In my case I used rowid to convert keys of original sparse
>> > >>> document vectors from Text to Integer before running cvb and this
>> > >>> generates a mapping file so I know the textual
>> > >>> keys that correspond to the numeric document ids (since my original
>> > >>> document ids were file names and I created named vectors).
>> > >>> Hope this helps.
>> > >>> Dan
>> > >>>
>> > >>> ________________________________
>> > >>>
>> > >>>  From: Folcon <fo...@gmail.com>
>> > >>> To: user@mahout.apache.org
>> > >>> Sent: Saturday, July 28, 2012 8:28 PM
>> > >>> Subject: Using Mahout to train an CVB and retrieve it's topics
>> > >>>
>> > >>> Hi Everyone,
>> > >>>
>> > >>> I'm posting this as my original message did not seem to appear on
>> the
>> > >>> mailing
>> > >>> list, I'm very sorry if I have done this in error.
>> > >>>
>> > >>> I'm doing this to then use the topics to train a maxent algorithm
>> to
>> > >>> predict the
>> > >>> classes of documents given their topic mixtures. Any further aid in
>> > this
>> > >>> direction would be appreciated!
>> > >>>
>> > >>> I've been trying to extract the topics out of my run of cvb. Here's
>> > >>> what I did
>> > >>> so far.
>> > >>>
>> > >>> Ok, so I still don't know how to output the topics, but I have
>> worked
>> > >>> out how to
>> > >>> get the cvb and what I think are the document vectors, however I'm
>> not
>> > >>> having
>> > >>> any luck dumping them, so help here would still be appreciated!
>> > >>>
>> > >>> I set the values of:
>> > >>>    export MAHOUT_HOME=/home/sgeadmin/mahout
>> > >>>    export HADOOP_HOME=/usr/lib/hadoop
>> > >>>    export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
>> > >>>    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
>> > >>> on the master otherwise none of this works.
>> > >>>
>> > >>> So first I uploaded the documents using starclusters put:
>> > >>>    starcluster put mycluster text_train /home/sgeadmin/
>> > >>>    starcluster put mycluster text_test /home/sgeadmin/
>> > >>>
>> > >>> Then I added them to hadoop's hbase filesystem:
>> > >>>    dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop
>> > >>> starcluster
>> > >>>
>> > >>> Then I called Mahout's seqdirectory to turn the text into sequence
>> > files
>> > >>>    $MAHOUT_HOME/bin/mahout seqdirectory --input
>> > /user/sgeadmin/text_train
>> > >>> --
>> > >>> output /user/sgeadmin/text_seq -c UTF-8 -ow
>> > >>>
>> > >>> Then I called Mahout's seq2parse to turn them into vectors
>> > >>>    $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin
>> > >>> /text_vec -
>> > >>> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>> > >>>
>> > >>> Finally I called cvb, I believe that the -dt flag states where the
>> > >>> inferred
>> > >>> topics should go, but because I haven't yet been able to dump them I
>> > >>> can't
>> > >>> confirm this.
>> > >>>    $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors
>> -o
>> > >>> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
>> > >>> /user/sgeadmin/text_vec/dictionary.file-0 -dt
>> > /user/sgeadmin/text_cvb_document
>> > >>> -
>> > >>> mt /user/sgeadmin/text_states
>> > >>>
>> > >>> The -k flag is the number of topics, the -nt flag is the size of
>> the
>> > >>> dictionary,
>> > >>> I computed this by counting the number of entries of the
>> > >>> dictionary.file-0
>> > >>> inside the vectors(in this case under /user/sgeadmin/text_vec) and
>> -x
>> > >>> is the
>> > >>> number of iterations.
>> > >>>
>> > >>> If you know how to get what the document topic probabilities are
>> from
>> > >>> here, help
>> > >>> would be most appreciated!
>> > >>>
>> > >>> Kind Regards,
>> > >>> Folcon
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> --
>> > >>>
>> > >>>  -jake
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>
>> > >>
>> > >
>> >
>> >
>> >
>> >
>>
>>
>>
>>
>
>

Re: Using Mahout to train an CVB and retrieve it's topics

Posted by Folcon Red <fo...@gmail.com>.

Hi Dan,

I've managed to get the text_seq and text_vec generated properly, however
when I run:

$MAHOUT_HOME/bin/mahout cvb -i /user/root/text_vec/tf-vectors -o
/user/root/text_lda -k 100 -nt 29536 -x 20 -dict
/user/root/text_vec/dictionary.file-0 -dt /user/root/text_cvb_document -
mt /user/root/text_states

I get:

12/08/05 21:18:04 INFO mapred.JobClient: Task Id :
attempt_201208051752_0002_m_000003_1, Status : FAILED
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
org.apache.hadoop.io.IntWritable
at
org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)

Task attempt_201208051752_0002_m_000003_1 failed to report status for 600
seconds. Killing!

Any ideas what's causing this?

Thank you for all the help so far!

Kind Regards,
Folcon

On 2 August 2012 02:41, Folcon Red <fo...@gmail.com> wrote:

> Thanks Dan,
>
> Ok, now for some strange reason it(seq and vec appear to have values now,
> will test the complete cvb later, I should head to bed...) appears to be
> working, The only things I think I changed was I stopped using absolute
> paths(referring to text_seq as opposed to /user/root/text_seq) and I'm
> using root now instead of sgeadmin.
>
> Regards,
> Folcon
>
>
> On 1 August 2012 03:00, DAN HELM <da...@verizon.net> wrote:
>
>> Hi Folcon,
>>
>> There is no reason to rerun seq2sparse as it is clear something is wrongwith the text files being processed by
>> seqdirectory command.
>>
>> Based on the keys, I'm assuming the files full path to the input files
>> are names like /high/59734, etc.  Did you look inside the files to make
>> sure there is text in them?
>>
>> As a test, just create a folder with a simple text file and run that
>> through seqdirectory and I'll bet you will then see output from seqdumpercommand (from
>> seqdirectory output).
>>
>> Thanks, Dan
>>
>>    *From:* Folcon Red <fo...@gmail.com>
>> *To:* DAN HELM <da...@verizon.net>
>> *Cc:* "user@mahout.apache.org" <us...@mahout.apache.org>
>> *Sent:* Tuesday, July 31, 2012 7:28 PM
>>
>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>>
>> Hi Dan,
>>
>> It's good to know that seqdirectory reads files in subfolders and I've
>> dumped out some of the values in the hopes that they will be
>> enlightening, The values seem to be missing for both the text_seq and
>> the tokenized-documents.
>>
>> So rerunning some of the commands:
>> $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train
>> --output /user/sgeadmin/text_seq -c UTF-8 -ow
>> $MAHOUT_HOME/bin/mahout seq2sparse -i /user/sgeadmin/text_seq -o
>> /user/sgeadmin/text_vec -wt tf -a
>> org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>>
>> And then doing a seqdumper of text_seq:
>> SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
>> [...]
>> Key: /high/59734: Value:
>> Key: /high/264596: Value:
>> Key: /high/341699: Value:
>> Key: /high/260770: Value:
>> Key: /high/222320: Value:
>> Key: /high/198156: Value:
>> Key: /high/326011: Value:
>> Key: /high/112050: Value:
>> Key: /high/306887: Value:
>> Key: /high/208169: Value:
>> Key: /high/283464: Value:
>> Key: /high/168905: Value:
>> Count: 2548
>>
>> root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout seqdumper -i
>> /user/sgeadmin/text_vec/tf-vectors/part-r-00000
>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
>> HADOOP_CONF_DIR=/conf
>> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
>> 12/07/31 23:23:34 INFO common.AbstractJob: Command line arguments:
>> {--endPhase=[2147483647],
>> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
>> --startPhase=[0], --tempDir=[temp]}
>> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.mahout.math.VectorWritable
>> Count: 0
>>
>> $MAHOUT_HOME/bin/mahout seqdumper -i
>> /user/sgeadmin/text_vec/tokenized-documents/part-m-00000
>> [...]
>> Key: /high/396063: Value: []
>> Key: /high/230246: Value: []
>> Key: /high/136284: Value: []
>> Key: /high/59734: Value: []
>> Key: /high/264596: Value: []
>> Key: /high/341699: Value: []
>> Key: /high/260770: Value: []
>> Key: /high/222320: Value: []
>> Key: /high/198156: Value: []
>> Key: /high/326011: Value: []
>> Key: /high/112050: Value: []
>> Key: /high/306887: Value: []
>> Key: /high/208169: Value: []
>> Key: /high/283464: Value: []
>> Key: /high/168905: Value: []
>> Count: 2548
>>
>>
>> Running vectordump on the text_vec folder like so:
>> root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout vectordump -i
>> /user/sgeadmin/text_vec
>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
>> HADOOP_CONF_DIR=/conf
>> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
>> 12/07/31 23:21:08 INFO common.AbstractJob: Command line arguments:
>> {--endPhase=[2147483647], --input=[/user/sgeadmin/text_vec],
>> --startPhase=[0], --tempDir=[temp]}
>> 12/07/31 23:21:08 INFO vectors.VectorDumper: Sort? false
>> Exception in thread "main" java.lang.IllegalStateException:
>> file:/user/sgeadmin/text_vec/tf-vectors
>> at
>>
>> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:63)
>> at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:194)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at
>> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:616)
>> at
>>
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:616)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>> Caused by: java.io.FileNotFoundException:
>> /user/sgeadmin/harry_old_mallet_vec/tf-vectors (Is a directory)
>> at java.io.FileInputStream.open(Native Method)
>> at java.io.FileInputStream.<init>(FileInputStream.java:137)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>(
>> RawLocalFileSystem.java:72)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(
>> RawLocalFileSystem.java:108)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:178)
>> at
>> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(
>> ChecksumFileSystem.java:127)
>> at
>> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1452)
>> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile
>> .java:1431)
>> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile
>> .java:1424)
>> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile
>> .java:1419)
>> at
>> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init
>> >(SequenceFileIterator.java:58)
>> at
>>
>> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61)
>> ... 15 more
>>
>> Kind Regards,
>> Nilu
>>
>> On 31 July 2012 23:59, DAN HELM <da...@verizon.net> wrote:
>>
>> > Folcon,
>> >
>> > seqdirectory should also read files in subfolders.
>> >
>> > Did you verify that recent seqdirectory command did in fact generate
>> > non-empty sequence files?  I believe seqdirectory command just assumes
>> > each file contains a single document (no concatenated documents per
>> > file), and that each file contains basic text.
>> >
>> > If it did generate sequence files this time, I am assume your folder
>> > "/user/sgeadmin/text_seq" was copied to hdfs (if not already there)
>> before
>> > you ran seq2sparse on it?
>> >
>> > Dan
>> >
>> >    *From:* Folcon Red <fo...@gmail.com>
>> > *To:* DAN HELM <da...@verizon.net>
>> > *Cc:* Jake Mannix <ja...@gmail.com>; "user@mahout.apache.org" <
>> > user@mahout.apache.org>
>> > *Sent:* Tuesday, July 31, 2012 1:34 PM
>>
>> >
>> > *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>> >
>> > So part-r-00000 inside text_vec is
>> > still SEQorg.apache.hadoop.io.Text%org.apache.mahout.math.VectorWritable
>> > even after moving all the training files into a single folder.
>> >
>> > Regards,
>> > Folcon
>> >
>> > On 31 July 2012 18:18, Folcon Red <fo...@gmail.com> wrote:
>> >
>> > > Hey Everyone,
>> > >
>> > > Ok not certain why  $MAHOUT_HOME/bin/mahout seqdirectory --input
>> /user/
>> > > sgeadmin/text_train --output /user/sgeadmin/text_seq -c UTF-8 -ow
>> didn't
>> >
>> > > produce sequence files, just looking inside text_seq only gives me:
>> > >
>> > > SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
>> > >
>> > > and that's it. Any ideas what I've been doing wrong? Maybe it's
>> because I
>> > > have the files nested in the folder by class, for example a tree view
>> of
>> > > the directory would look like.
>> > >
>> > > text_train -+
>> > >                | A -+
>> > >                        | 100
>> > >                        | 101
>> > >                        | 103
>> > >                | B -+
>> > >                        | 102
>> > >                        | 105
>> > >                        | 106
>> > >
>> > > So it's not picking them up? Or perhaps something else? I'm going to
>> try
>> > > some variations to see what happens.
>> > >
>> > > Thanks for the help so far!
>> > >
>> > > Regards,
>> > > Folcon
>> > >
>> > >
>> > > On 29 July 2012 22:10, Folcon Red <fo...@gmail.com> wrote:
>> > >
>> > >> Right, well here's something promising, running
>> $MAHOUT_HOME/bin/mahout
>> > >> seqdumper -i /user/sgeadmin/text_lda/part-m-00000 produced:
>> > >>
>> > >>
>> > >>
>> > 7:NaN,29478:NaN,29479:NaN,29480:NaN,29481:NaN,29482:NaN,29483:NaN
>> ,29484:NaN,29485:NaN,29486:NaN,29487:NaN,29488:NaN,29489:NaN,29490:NaN
>> ,29491:NaN,29492:NaN,29493:NaN,29494:NaN,29495:NaN,29496:NaN,29497:NaN
>> ,29498:NaN,29499:NaN,29500:NaN,29501:NaN,29502:NaN,29503:NaN,29504:NaN
>> ,29505:NaN,29506:NaN,29507:NaN,29508:NaN,29509:NaN,29510:NaN,29511:NaN
>> ,29512:NaN,29513:NaN,29514:NaN,29515:NaN,29516:NaN,29517:NaN,29518:NaN
>> ,29519:NaN,29520:NaN,29521:NaN,29522:NaN,29523:NaN,29524:NaN,29525:NaN
>> ,29526:NaN,29527:NaN,29528:NaN,29529:NaN,29530:NaN,29531:NaN,29532:NaN
>> ,29533:NaN,29534:NaN,29535:NaN}
>> > >>
>> > >> And $MAHOUT_HOME/bin/mahout seqdumper -i
>> > >> /user/sgeadmin/text_vec/tf-vectors/part-r-00000 produced:
>> > >>
>> > >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>> > >> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
>> > >> HADOOP_CONF_DIR=/usr/lib/hadoop/conf
>> > >> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
>> > >> 12/07/29 21:09:17 INFO common.AbstractJob: Command line arguments:
>> > >> {--endPhase=[2147483647],
>> > >> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
>> > >> --startPhase=[0], --tempDir=[temp]}
>> > >> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
>> > >> Key class: class org.apache.hadoop.io.Text Value Class: class
>> > >> org.apache.mahout.math.VectorWritable
>> > >> Count: 0
>> > >>
>> > >> Kind Regards,
>> > >> Folcon
>> > >>
>> > >> On 29 July 2012 21:29, DAN HELM <da...@verizon.net> wrote:
>> > >>
>> > >>> Yep something went wrong, most likely with the clustering.  part
>> file
>> > is
>> > >>> empty.  Should look something like this:
>> > >>>
>> > >>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
>> > >>> org.apache.mahout.math.VectorWritable
>> > >>> Key: 0: Value:
>> > >>>
>> >
>> {0:0.06475650422868284,1:0.010728747158503565,2:0.005463535698651016,3:0.023451709705466457}
>> > >>> Key: 1: Value:
>> > >>>
>> >
>> {0:0.01838885430227092,1:0.05068404879399544,2:0.002110418548647133,3:0.005566514441743756}
>> > >>> Key: 2: Value:
>> > >>>
>> >
>> {0:0.018575587065216153,1:1.236602313900785E-5,2:8.654629660837919E-6,3:5.820637306957196E-6}
>> > >>> ...
>> > >>> ...
>> > >>>
>> > >>> Key refers to a document id and the Value are topic ids:weights
>> > assigned
>> > >>> to document id.
>> > >>>
>> > >>> So you need to figure out where things went wrong.  I'm assume
>> folder
>> > >>> /user/sgeadmin/text_lda also has empty part files?  Assuming parts
>> > >>> files are there run seqdumper on one.  Should have data like the
>> above
>> > >>> except in this case the key will be a topic id and the vector will
>> be
>> > term
>> > >>> ids:weights.
>> > >>>
>> > >>> You can also check folder /user/sgeadmin/text_vec/tf-vectors to
>> make
>> > >>> sure sparse vectors were generated for your input to cvb.
>> > >>>
>> > >>> Dan
>> > >>>
>> > >>>    *From:* Folcon Red <fo...@gmail.com>
>> > >>> *To:* DAN HELM <da...@verizon.net>
>> > >>> *Cc:* Jake Mannix <ja...@gmail.com>; "user@mahout.apache.org"
>> <
>> > >>> user@mahout.apache.org>
>> > >>> *Sent:* Sunday, July 29, 2012 3:35 PM
>> > >>>
>> > >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's
>> topics
>> >
>> > >>>
>> > >>> Thanks Dan and Jake,
>> > >>>
>> > >>> The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/
>> > >>> sgeadmin/text_cvb_document/part-m-00000 is:
>> > >>>
>> > >>> Input Path: /user/sgeadmin/text_cvb_document/part-m-00000
>> > >>> Key class: class org.apache.hadoop.io <
>> > http://org.apache.hadoop.io.int/>
>> >
>> > >>> .IntWritable Value Class: class
>> org.apache.mahout.math.VectorWritable
>> > >>> Count: 0
>> > >>>
>> > >>> I'm not certain what went wrong.
>> > >>>
>> > >>> Kind Regards,
>> > >>> Folcon
>> > >>>
>> > >>> On 29 July 2012 18:49, DAN HELM <da...@verizon.net> wrote:
>> > >>>
>> > >>> Folcon,
>> > >>>
>> > >>> I'm still using Mahout 0.6 so don't know much about changes in 0.7.
>> > >>>
>> > >>> Your output folder for "dt" looks correct.  The relevant data
>> would be
>> > >>> in  /user/sgeadmin/text_cvb_document/part-m-00000 which is what I
>> would
>> > >>> be passing to a "-s" option.  But I see it says size is only 97 so
>> that
>> > >>> looks suspicious.  So you can just view file (for starters) as:
>> mahout
>> > >>> seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000.  And
>> the
>> > >>> vector dumper command (as Jake pointed out) has a lot more options
>> to
>> > >>> post-process the data but you may want to first just see what is in
>> > >>> that file.
>> > >>>
>> > >>> Dan
>> > >>>
>> > >>>    *From:* Folcon Red <fo...@gmail.com>
>> > >>> *To:* Jake Mannix <ja...@gmail.com>
>> > >>> *Cc:* user@mahout.apache.org; DAN HELM <da...@verizon.net>
>> > >>> *Sent:* Sunday, July 29, 2012 1:08 PM
>> > >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's
>> topics
>> >
>> > >>>
>> > >>> Hi Guys,
>> > >>>
>> > >>> Thanks for replying, the problem is whenever I use any -s flag I get
>> > the
>> > >>> error "Unexpected -s while processing Job-Specific Options:"
>> > >>>
>> > >>> Also I'm not sure if this is supposed to be the output of -dt
>> > >>>
>> > >>> sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -
>> hadoop
>> > >>> starcluster
>> > >>> Found 3 items
>> > >>> -rw-r--r--  3 sgeadmin supergroup          0 2012-07-29 16:51
>> /user/
>> > >>> sgeadmin/text_cvb_document/_SUCCESS
>> > >>> drwxr-xr-x  - sgeadmin supergroup          0 2012-07-29 16:50
>> /user/
>> > >>> sgeadmin/text_cvb_document/_logs
>> > >>> -rw-r--r--  3 sgeadmin supergroup        97 2012-07-29 16:51 /user/
>> > >>> sgeadmin/text_cvb_document/part-m-00000
>> > >>>
>> > >>> Should I be using a newer version of mahout? I've just been using
>> the
>> > >>> 0.7 distribution so far as apparently the compiled versions are
>> missing
>> > >>> parts that the distributed ones have.
>> > >>>
>> > >>> Kind Regards,
>> > >>> Folcon
>> > >>>
>> > >>> PS: Thanks for the help so far!
>> > >>>
>> > >>> On 29 July 2012 04:52, Jake Mannix <ja...@gmail.com> wrote:
>> > >>>
>> > >>>
>> > >>>
>> > >>> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <danielhelm@verizon.net
>> > >wrote:
>> > >>>
>> > >>> Hi Folcon,
>> > >>>
>> > >>> In the folder you specified for the –dt option for cvb command
>> > >>> there should be sequence files with the document to topic
>> associations
>> > >>> (Key:
>> > >>> IntWritable, Value: VectorWritable).
>> > >>>
>> > >>>
>> > >>> Yeah, this is correct, although this:
>> > >>>
>> > >>>
>> > >>> You can dump in text format as: mahout seqdumper –s <sequence file>
>> > >>>
>> > >>>
>> > >>> is not as good as using vectordumper:
>> > >>>
>> > >>>    mahout vectordump -s <sequence file> --dictionary <path to
>> > dictionary.file-0>
>> > >>> \
>> > >>>        --dictionaryType seqfile --vectorSize <num entries per
>> topic you
>> > >>> want to see> -sort
>> > >>>
>> > >>> This joins your topic vectors with the dictionary, then picks out
>> the
>> > >>> top k terms (with their
>> > >>> probabilities) for each topic and prints them to the console (or to
>> the
>> > >>> file you specify with
>> > >>> an --output option).
>> > >>>
>> > >>> *although* I notice now that in trunk when I just checked,
>> > VectorDumper.java
>> > >>> had a bug
>> > >>> in it for "vectorSize" - line 175 asks for cmdline option "
>> > >>> numIndexesPerVector" not
>> > >>> vectorSize, ack!  So I took the liberty of fixing that, but you'll
>> need
>> > >>> to "svn up" and rebuild
>> > >>> your jar before using vectordump like this.
>> > >>>
>> > >>>
>> > >>>  So in text output from seqdumper, the key is a document id and the
>> > >>> vector contains
>> > >>> the topics and associated scores associated with the document.  I
>> think
>> > >>> all topics are listed for each
>> > >>> document but many with near zero score.
>> > >>> In my case I used rowid to convert keys of original sparse
>> > >>> document vectors from Text to Integer before running cvb and this
>> > >>> generates a mapping file so I know the textual
>> > >>> keys that correspond to the numeric document ids (since my original
>> > >>> document ids were file names and I created named vectors).
>> > >>> Hope this helps.
>> > >>> Dan
>> > >>>
>> > >>> ________________________________
>> > >>>
>> > >>>  From: Folcon <fo...@gmail.com>
>> > >>> To: user@mahout.apache.org
>> > >>> Sent: Saturday, July 28, 2012 8:28 PM
>> > >>> Subject: Using Mahout to train an CVB and retrieve it's topics
>> > >>>
>> > >>> Hi Everyone,
>> > >>>
>> > >>> I'm posting this as my original message did not seem to appear on
>> the
>> > >>> mailing
>> > >>> list, I'm very sorry if I have done this in error.
>> > >>>
>> > >>> I'm doing this to then use the topics to train a maxent algorithm
>> to
>> > >>> predict the
>> > >>> classes of documents given their topic mixtures. Any further aid in
>> > this
>> > >>> direction would be appreciated!
>> > >>>
>> > >>> I've been trying to extract the topics out of my run of cvb. Here's
>> > >>> what I did
>> > >>> so far.
>> > >>>
>> > >>> Ok, so I still don't know how to output the topics, but I have
>> worked
>> > >>> out how to
>> > >>> get the cvb and what I think are the document vectors, however I'm
>> not
>> > >>> having
>> > >>> any luck dumping them, so help here would still be appreciated!
>> > >>>
>> > >>> I set the values of:
>> > >>>    export MAHOUT_HOME=/home/sgeadmin/mahout
>> > >>>    export HADOOP_HOME=/usr/lib/hadoop
>> > >>>    export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
>> > >>>    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
>> > >>> on the master otherwise none of this works.
>> > >>>
>> > >>> So first I uploaded the documents using starclusters put:
>> > >>>    starcluster put mycluster text_train /home/sgeadmin/
>> > >>>    starcluster put mycluster text_test /home/sgeadmin/
>> > >>>
>> > >>> Then I added them to hadoop's hbase filesystem:
>> > >>>    dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop
>> > >>> starcluster
>> > >>>
>> > >>> Then I called Mahout's seqdirectory to turn the text into sequence
>> > files
>> > >>>    $MAHOUT_HOME/bin/mahout seqdirectory --input
>> > /user/sgeadmin/text_train
>> > >>> --
>> > >>> output /user/sgeadmin/text_seq -c UTF-8 -ow
>> > >>>
>> > >>> Then I called Mahout's seq2parse to turn them into vectors
>> > >>>    $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin
>> > >>> /text_vec -
>> > >>> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>> > >>>
>> > >>> Finally I called cvb, I believe that the -dt flag states where the
>> > >>> inferred
>> > >>> topics should go, but because I haven't yet been able to dump them I
>> > >>> can't
>> > >>> confirm this.
>> > >>>    $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors
>> -o
>> > >>> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
>> > >>> /user/sgeadmin/text_vec/dictionary.file-0 -dt
>> > /user/sgeadmin/text_cvb_document
>> > >>> -
>> > >>> mt /user/sgeadmin/text_states
>> > >>>
>> > >>> The -k flag is the number of topics, the -nt flag is the size of
>> the
>> > >>> dictionary,
>> > >>> I computed this by counting the number of entries of the
>> > >>> dictionary.file-0
>> > >>> inside the vectors(in this case under /user/sgeadmin/text_vec) and
>> -x
>> > >>> is the
>> > >>> number of iterations.
>> > >>>
>> > >>> If you know how to get what the document topic probabilities are
>> from
>> > >>> here, help
>> > >>> would be most appreciated!
>> > >>>
>> > >>> Kind Regards,
>> > >>> Folcon
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> --
>> > >>>
>> > >>>  -jake
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>
>> > >>
>> > >
>> >
>> >
>> >
>> >
>>
>>
>>
>>
>
>

Re: Using Mahout to train an CVB and retrieve it's topics

Posted by Folcon Red <fo...@gmail.com>.

Thanks Dan,

Ok, now for some strange reason it(seq and vec appear to have values now,
will test the complete cvb later, I should head to bed...) appears to be
working, The only things I think I changed was I stopped using absolute
paths(referring to text_seq as opposed to /user/root/text_seq) and I'm
using root now instead of sgeadmin.

Regards,
Folcon

On 1 August 2012 03:00, DAN HELM <da...@verizon.net> wrote:

> Hi Folcon,
>
> There is no reason to rerun seq2sparse as it is clear something is wrongwith the text files being processed by
> seqdirectory command.
>
> Based on the keys, I'm assuming the files full path to the input files
> are names like /high/59734, etc.  Did you look inside the files to make
> sure there is text in them?
>
> As a test, just create a folder with a simple text file and run that
> through seqdirectory and I'll bet you will then see output from seqdumpercommand (from
> seqdirectory output).
>
> Thanks, Dan
>
>    *From:* Folcon Red <fo...@gmail.com>
> *To:* DAN HELM <da...@verizon.net>
> *Cc:* "user@mahout.apache.org" <us...@mahout.apache.org>
> *Sent:* Tuesday, July 31, 2012 7:28 PM
>
> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>
> Hi Dan,
>
> It's good to know that seqdirectory reads files in subfolders and I've
> dumped out some of the values in the hopes that they will be
> enlightening, The values seem to be missing for both the text_seq and
> the tokenized-documents.
>
> So rerunning some of the commands:
> $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train
> --output /user/sgeadmin/text_seq -c UTF-8 -ow
> $MAHOUT_HOME/bin/mahout seq2sparse -i /user/sgeadmin/text_seq -o
> /user/sgeadmin/text_vec -wt tf -a
> org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>
> And then doing a seqdumper of text_seq:
> SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
> [...]
> Key: /high/59734: Value:
> Key: /high/264596: Value:
> Key: /high/341699: Value:
> Key: /high/260770: Value:
> Key: /high/222320: Value:
> Key: /high/198156: Value:
> Key: /high/326011: Value:
> Key: /high/112050: Value:
> Key: /high/306887: Value:
> Key: /high/208169: Value:
> Key: /high/283464: Value:
> Key: /high/168905: Value:
> Count: 2548
>
> root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout seqdumper -i
> /user/sgeadmin/text_vec/tf-vectors/part-r-00000
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
> HADOOP_CONF_DIR=/conf
> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
> 12/07/31 23:23:34 INFO common.AbstractJob: Command line arguments:
> {--endPhase=[2147483647],
> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
> --startPhase=[0], --tempDir=[temp]}
> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
> Key class: class org.apache.hadoop.io.Text Value Class: class
> org.apache.mahout.math.VectorWritable
> Count: 0
>
> $MAHOUT_HOME/bin/mahout seqdumper -i
> /user/sgeadmin/text_vec/tokenized-documents/part-m-00000
> [...]
> Key: /high/396063: Value: []
> Key: /high/230246: Value: []
> Key: /high/136284: Value: []
> Key: /high/59734: Value: []
> Key: /high/264596: Value: []
> Key: /high/341699: Value: []
> Key: /high/260770: Value: []
> Key: /high/222320: Value: []
> Key: /high/198156: Value: []
> Key: /high/326011: Value: []
> Key: /high/112050: Value: []
> Key: /high/306887: Value: []
> Key: /high/208169: Value: []
> Key: /high/283464: Value: []
> Key: /high/168905: Value: []
> Count: 2548
>
>
> Running vectordump on the text_vec folder like so:
> root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout vectordump -i
> /user/sgeadmin/text_vec
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
> HADOOP_CONF_DIR=/conf
> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
> 12/07/31 23:21:08 INFO common.AbstractJob: Command line arguments:
> {--endPhase=[2147483647], --input=[/user/sgeadmin/text_vec],
> --startPhase=[0], --tempDir=[temp]}
> 12/07/31 23:21:08 INFO vectors.VectorDumper: Sort? false
> Exception in thread "main" java.lang.IllegalStateException:
> file:/user/sgeadmin/text_vec/tf-vectors
> at
>
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:63)
> at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:194)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:616)
> at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:616)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: java.io.FileNotFoundException:
> /user/sgeadmin/harry_old_mallet_vec/tf-vectors (Is a directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.<init>(FileInputStream.java:137)
> at
> org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>(
> RawLocalFileSystem.java:72)
> at
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(
> RawLocalFileSystem.java:108)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:178)
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(
> ChecksumFileSystem.java:127)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284)
> at
> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1452)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1431)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1419)
> at
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init
> >(SequenceFileIterator.java:58)
> at
>
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61)
> ... 15 more
>
> Kind Regards,
> Nilu
>
> On 31 July 2012 23:59, DAN HELM <da...@verizon.net> wrote:
>
> > Folcon,
> >
> > seqdirectory should also read files in subfolders.
> >
> > Did you verify that recent seqdirectory command did in fact generate
> > non-empty sequence files?  I believe seqdirectory command just assumes
> > each file contains a single document (no concatenated documents per
> > file), and that each file contains basic text.
> >
> > If it did generate sequence files this time, I am assume your folder
> > "/user/sgeadmin/text_seq" was copied to hdfs (if not already there)
> before
> > you ran seq2sparse on it?
> >
> > Dan
> >
> >    *From:* Folcon Red <fo...@gmail.com>
> > *To:* DAN HELM <da...@verizon.net>
> > *Cc:* Jake Mannix <ja...@gmail.com>; "user@mahout.apache.org" <
> > user@mahout.apache.org>
> > *Sent:* Tuesday, July 31, 2012 1:34 PM
>
> >
> > *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
> >
> > So part-r-00000 inside text_vec is
> > still SEQorg.apache.hadoop.io.Text%org.apache.mahout.math.VectorWritable
> > even after moving all the training files into a single folder.
> >
> > Regards,
> > Folcon
> >
> > On 31 July 2012 18:18, Folcon Red <fo...@gmail.com> wrote:
> >
> > > Hey Everyone,
> > >
> > > Ok not certain why  $MAHOUT_HOME/bin/mahout seqdirectory --input
> /user/
> > > sgeadmin/text_train --output /user/sgeadmin/text_seq -c UTF-8 -ow
> didn't
> >
> > > produce sequence files, just looking inside text_seq only gives me:
> > >
> > > SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
> > >
> > > and that's it. Any ideas what I've been doing wrong? Maybe it's
> because I
> > > have the files nested in the folder by class, for example a tree view
> of
> > > the directory would look like.
> > >
> > > text_train -+
> > >                | A -+
> > >                        | 100
> > >                        | 101
> > >                        | 103
> > >                | B -+
> > >                        | 102
> > >                        | 105
> > >                        | 106
> > >
> > > So it's not picking them up? Or perhaps something else? I'm going to
> try
> > > some variations to see what happens.
> > >
> > > Thanks for the help so far!
> > >
> > > Regards,
> > > Folcon
> > >
> > >
> > > On 29 July 2012 22:10, Folcon Red <fo...@gmail.com> wrote:
> > >
> > >> Right, well here's something promising, running
> $MAHOUT_HOME/bin/mahout
> > >> seqdumper -i /user/sgeadmin/text_lda/part-m-00000 produced:
> > >>
> > >>
> > >>
> > 7:NaN,29478:NaN,29479:NaN,29480:NaN,29481:NaN,29482:NaN,29483:NaN,29484:
> NaN,29485:NaN,29486:NaN,29487:NaN,29488:NaN,29489:NaN,29490:NaN,29491:NaN
> ,29492:NaN,29493:NaN,29494:NaN,29495:NaN,29496:NaN,29497:NaN,29498:NaN
> ,29499:NaN,29500:NaN,29501:NaN,29502:NaN,29503:NaN,29504:NaN,29505:NaN
> ,29506:NaN,29507:NaN,29508:NaN,29509:NaN,29510:NaN,29511:NaN,29512:NaN
> ,29513:NaN,29514:NaN,29515:NaN,29516:NaN,29517:NaN,29518:NaN,29519:NaN
> ,29520:NaN,29521:NaN,29522:NaN,29523:NaN,29524:NaN,29525:NaN,29526:NaN
> ,29527:NaN,29528:NaN,29529:NaN,29530:NaN,29531:NaN,29532:NaN,29533:NaN
> ,29534:NaN,29535:NaN}
> > >>
> > >> And $MAHOUT_HOME/bin/mahout seqdumper -i
> > >> /user/sgeadmin/text_vec/tf-vectors/part-r-00000 produced:
> > >>
> > >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> > >> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
> > >> HADOOP_CONF_DIR=/usr/lib/hadoop/conf
> > >> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
> > >> 12/07/29 21:09:17 INFO common.AbstractJob: Command line arguments:
> > >> {--endPhase=[2147483647],
> > >> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
> > >> --startPhase=[0], --tempDir=[temp]}
> > >> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
> > >> Key class: class org.apache.hadoop.io.Text Value Class: class
> > >> org.apache.mahout.math.VectorWritable
> > >> Count: 0
> > >>
> > >> Kind Regards,
> > >> Folcon
> > >>
> > >> On 29 July 2012 21:29, DAN HELM <da...@verizon.net> wrote:
> > >>
> > >>> Yep something went wrong, most likely with the clustering.  part file
> > is
> > >>> empty.  Should look something like this:
> > >>>
> > >>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> > >>> org.apache.mahout.math.VectorWritable
> > >>> Key: 0: Value:
> > >>>
> >
> {0:0.06475650422868284,1:0.010728747158503565,2:0.005463535698651016,3:0.023451709705466457}
> > >>> Key: 1: Value:
> > >>>
> >
> {0:0.01838885430227092,1:0.05068404879399544,2:0.002110418548647133,3:0.005566514441743756}
> > >>> Key: 2: Value:
> > >>>
> >
> {0:0.018575587065216153,1:1.236602313900785E-5,2:8.654629660837919E-6,3:5.820637306957196E-6}
> > >>> ...
> > >>> ...
> > >>>
> > >>> Key refers to a document id and the Value are topic ids:weights
> > assigned
> > >>> to document id.
> > >>>
> > >>> So you need to figure out where things went wrong.  I'm assume folder
> > >>> /user/sgeadmin/text_lda also has empty part files?  Assuming parts
> > >>> files are there run seqdumper on one.  Should have data like the
> above
> > >>> except in this case the key will be a topic id and the vector will be
> > term
> > >>> ids:weights.
> > >>>
> > >>> You can also check folder /user/sgeadmin/text_vec/tf-vectors to make
> > >>> sure sparse vectors were generated for your input to cvb.
> > >>>
> > >>> Dan
> > >>>
> > >>>    *From:* Folcon Red <fo...@gmail.com>
> > >>> *To:* DAN HELM <da...@verizon.net>
> > >>> *Cc:* Jake Mannix <ja...@gmail.com>; "user@mahout.apache.org"
> <
> > >>> user@mahout.apache.org>
> > >>> *Sent:* Sunday, July 29, 2012 3:35 PM
> > >>>
> > >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
> >
> > >>>
> > >>> Thanks Dan and Jake,
> > >>>
> > >>> The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/
> > >>> sgeadmin/text_cvb_document/part-m-00000 is:
> > >>>
> > >>> Input Path: /user/sgeadmin/text_cvb_document/part-m-00000
> > >>> Key class: class org.apache.hadoop.io <
> > http://org.apache.hadoop.io.int/>
> >
> > >>> .IntWritable Value Class: class
> org.apache.mahout.math.VectorWritable
> > >>> Count: 0
> > >>>
> > >>> I'm not certain what went wrong.
> > >>>
> > >>> Kind Regards,
> > >>> Folcon
> > >>>
> > >>> On 29 July 2012 18:49, DAN HELM <da...@verizon.net> wrote:
> > >>>
> > >>> Folcon,
> > >>>
> > >>> I'm still using Mahout 0.6 so don't know much about changes in 0.7.
> > >>>
> > >>> Your output folder for "dt" looks correct.  The relevant data would
> be
> > >>> in  /user/sgeadmin/text_cvb_document/part-m-00000 which is what I
> would
> > >>> be passing to a "-s" option.  But I see it says size is only 97 so
> that
> > >>> looks suspicious.  So you can just view file (for starters) as:
> mahout
> > >>> seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000.  And the
> > >>> vector dumper command (as Jake pointed out) has a lot more options to
> > >>> post-process the data but you may want to first just see what is in
> > >>> that file.
> > >>>
> > >>> Dan
> > >>>
> > >>>    *From:* Folcon Red <fo...@gmail.com>
> > >>> *To:* Jake Mannix <ja...@gmail.com>
> > >>> *Cc:* user@mahout.apache.org; DAN HELM <da...@verizon.net>
> > >>> *Sent:* Sunday, July 29, 2012 1:08 PM
> > >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
> >
> > >>>
> > >>> Hi Guys,
> > >>>
> > >>> Thanks for replying, the problem is whenever I use any -s flag I get
> > the
> > >>> error "Unexpected -s while processing Job-Specific Options:"
> > >>>
> > >>> Also I'm not sure if this is supposed to be the output of -dt
> > >>>
> > >>> sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop
> > >>> starcluster
> > >>> Found 3 items
> > >>> -rw-r--r--  3 sgeadmin supergroup          0 2012-07-29 16:51 /user/
> > >>> sgeadmin/text_cvb_document/_SUCCESS
> > >>> drwxr-xr-x  - sgeadmin supergroup          0 2012-07-29 16:50 /user/
> > >>> sgeadmin/text_cvb_document/_logs
> > >>> -rw-r--r--  3 sgeadmin supergroup        97 2012-07-29 16:51 /user/
> > >>> sgeadmin/text_cvb_document/part-m-00000
> > >>>
> > >>> Should I be using a newer version of mahout? I've just been using the
> > >>> 0.7 distribution so far as apparently the compiled versions are
> missing
> > >>> parts that the distributed ones have.
> > >>>
> > >>> Kind Regards,
> > >>> Folcon
> > >>>
> > >>> PS: Thanks for the help so far!
> > >>>
> > >>> On 29 July 2012 04:52, Jake Mannix <ja...@gmail.com> wrote:
> > >>>
> > >>>
> > >>>
> > >>> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <danielhelm@verizon.net
> > >wrote:
> > >>>
> > >>> Hi Folcon,
> > >>>
> > >>> In the folder you specified for the –dt option for cvb command
> > >>> there should be sequence files with the document to topic
> associations
> > >>> (Key:
> > >>> IntWritable, Value: VectorWritable).
> > >>>
> > >>>
> > >>> Yeah, this is correct, although this:
> > >>>
> > >>>
> > >>> You can dump in text format as: mahout seqdumper –s <sequence file>
> > >>>
> > >>>
> > >>> is not as good as using vectordumper:
> > >>>
> > >>>    mahout vectordump -s <sequence file> --dictionary <path to
> > dictionary.file-0>
> > >>> \
> > >>>        --dictionaryType seqfile --vectorSize <num entries per topic
> you
> > >>> want to see> -sort
> > >>>
> > >>> This joins your topic vectors with the dictionary, then picks out the
> > >>> top k terms (with their
> > >>> probabilities) for each topic and prints them to the console (or to
> the
> > >>> file you specify with
> > >>> an --output option).
> > >>>
> > >>> *although* I notice now that in trunk when I just checked,
> > VectorDumper.java
> > >>> had a bug
> > >>> in it for "vectorSize" - line 175 asks for cmdline option "
> > >>> numIndexesPerVector" not
> > >>> vectorSize, ack!  So I took the liberty of fixing that, but you'll
> need
> > >>> to "svn up" and rebuild
> > >>> your jar before using vectordump like this.
> > >>>
> > >>>
> > >>>  So in text output from seqdumper, the key is a document id and the
> > >>> vector contains
> > >>> the topics and associated scores associated with the document.  I
> think
> > >>> all topics are listed for each
> > >>> document but many with near zero score.
> > >>> In my case I used rowid to convert keys of original sparse
> > >>> document vectors from Text to Integer before running cvb and this
> > >>> generates a mapping file so I know the textual
> > >>> keys that correspond to the numeric document ids (since my original
> > >>> document ids were file names and I created named vectors).
> > >>> Hope this helps.
> > >>> Dan
> > >>>
> > >>> ________________________________
> > >>>
> > >>>  From: Folcon <fo...@gmail.com>
> > >>> To: user@mahout.apache.org
> > >>> Sent: Saturday, July 28, 2012 8:28 PM
> > >>> Subject: Using Mahout to train an CVB and retrieve it's topics
> > >>>
> > >>> Hi Everyone,
> > >>>
> > >>> I'm posting this as my original message did not seem to appear on the
> > >>> mailing
> > >>> list, I'm very sorry if I have done this in error.
> > >>>
> > >>> I'm doing this to then use the topics to train a maxent algorithm to
> > >>> predict the
> > >>> classes of documents given their topic mixtures. Any further aid in
> > this
> > >>> direction would be appreciated!
> > >>>
> > >>> I've been trying to extract the topics out of my run of cvb. Here's
> > >>> what I did
> > >>> so far.
> > >>>
> > >>> Ok, so I still don't know how to output the topics, but I have
> worked
> > >>> out how to
> > >>> get the cvb and what I think are the document vectors, however I'm
> not
> > >>> having
> > >>> any luck dumping them, so help here would still be appreciated!
> > >>>
> > >>> I set the values of:
> > >>>    export MAHOUT_HOME=/home/sgeadmin/mahout
> > >>>    export HADOOP_HOME=/usr/lib/hadoop
> > >>>    export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
> > >>>    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
> > >>> on the master otherwise none of this works.
> > >>>
> > >>> So first I uploaded the documents using starclusters put:
> > >>>    starcluster put mycluster text_train /home/sgeadmin/
> > >>>    starcluster put mycluster text_test /home/sgeadmin/
> > >>>
> > >>> Then I added them to hadoop's hbase filesystem:
> > >>>    dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop
> > >>> starcluster
> > >>>
> > >>> Then I called Mahout's seqdirectory to turn the text into sequence
> > files
> > >>>    $MAHOUT_HOME/bin/mahout seqdirectory --input
> > /user/sgeadmin/text_train
> > >>> --
> > >>> output /user/sgeadmin/text_seq -c UTF-8 -ow
> > >>>
> > >>> Then I called Mahout's seq2parse to turn them into vectors
> > >>>    $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin
> > >>> /text_vec -
> > >>> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
> > >>>
> > >>> Finally I called cvb, I believe that the -dt flag states where the
> > >>> inferred
> > >>> topics should go, but because I haven't yet been able to dump them I
> > >>> can't
> > >>> confirm this.
> > >>>    $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors
> -o
> > >>> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
> > >>> /user/sgeadmin/text_vec/dictionary.file-0 -dt
> > /user/sgeadmin/text_cvb_document
> > >>> -
> > >>> mt /user/sgeadmin/text_states
> > >>>
> > >>> The -k flag is the number of topics, the -nt flag is the size of the
> > >>> dictionary,
> > >>> I computed this by counting the number of entries of the
> > >>> dictionary.file-0
> > >>> inside the vectors(in this case under /user/sgeadmin/text_vec) and
> -x
> > >>> is the
> > >>> number of iterations.
> > >>>
> > >>> If you know how to get what the document topic probabilities are from
> > >>> here, help
> > >>> would be most appreciated!
> > >>>
> > >>> Kind Regards,
> > >>> Folcon
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>>
> > >>>  -jake
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > >
> >
> >
> >
> >
>
>
>
>

Re: Using Mahout to train an CVB and retrieve it's topics

Posted by DAN HELM <da...@verizon.net>.

Hi Folcon,
 
There is no reason to rerun seq2sparse as it is clear something is wrong with the text files being processed by seqdirectory command.
 
Based on the keys, I'm assuming the files full path to the input files are names like /high/59734, etc.  Did you look inside the files to make sure there is text in them?
 
As a test, just create a folder with a simple text file and run that through seqdirectory and I'll bet you will then see output from seqdumper command (from seqdirectory output).
 
Thanks, Dan
 

________________________________
 From: Folcon Red <fo...@gmail.com>
To: DAN HELM <da...@verizon.net> 
Cc: "user@mahout.apache.org" <us...@mahout.apache.org> 
Sent: Tuesday, July 31, 2012 7:28 PM
Subject: Re: Using Mahout to train an CVB and retrieve it's topics
  
Hi Dan,

It's good to know that seqdirectory reads files in subfolders and I've
dumped out some of the values in the hopes that they will be
enlightening, The values seem to be missing for both the text_seq and
the tokenized-documents.

So rerunning some of the commands:
$MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train
--output /user/sgeadmin/text_seq -c UTF-8 -ow
$MAHOUT_HOME/bin/mahout seq2sparse -i /user/sgeadmin/text_seq -o
/user/sgeadmin/text_vec -wt tf -a
org.apache.lucene.analysis.WhitespaceAnalyzer -ow

And then doing a seqdumper of text_seq:
SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
[...]
Key: /high/59734: Value:
Key: /high/264596: Value:
Key: /high/341699: Value:
Key: /high/260770: Value:
Key: /high/222320: Value:
Key: /high/198156: Value:
Key: /high/326011: Value:
Key: /high/112050: Value:
Key: /high/306887: Value:
Key: /high/208169: Value:
Key: /high/283464: Value:
Key: /high/168905: Value:
Count: 2548

root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout seqdumper -i
/user/sgeadmin/text_vec/tf-vectors/part-r-00000
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
HADOOP_CONF_DIR=/conf
MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
12/07/31 23:23:34 INFO common.AbstractJob: Command line arguments:
{--endPhase=[2147483647],
--input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
--startPhase=[0], --tempDir=[temp]}
Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.mahout.math.VectorWritable
Count: 0

$MAHOUT_HOME/bin/mahout seqdumper -i
/user/sgeadmin/text_vec/tokenized-documents/part-m-00000
[...]
Key: /high/396063: Value: []
Key: /high/230246: Value: []
Key: /high/136284: Value: []
Key: /high/59734: Value: []
Key: /high/264596: Value: []
Key: /high/341699: Value: []
Key: /high/260770: Value: []
Key: /high/222320: Value: []
Key: /high/198156: Value: []
Key: /high/326011: Value: []
Key: /high/112050: Value: []
Key: /high/306887: Value: []
Key: /high/208169: Value: []
Key: /high/283464: Value: []
Key: /high/168905: Value: []
Count: 2548


Running vectordump on the text_vec folder like so:
root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout vectordump -i
/user/sgeadmin/text_vec
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
HADOOP_CONF_DIR=/conf
MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
12/07/31 23:21:08 INFO common.AbstractJob: Command line arguments:
{--endPhase=[2147483647], --input=[/user/sgeadmin/text_vec],
--startPhase=[0], --tempDir=[temp]}
12/07/31 23:21:08 INFO vectors.VectorDumper: Sort? false
Exception in thread "main" java.lang.IllegalStateException:
file:/user/sgeadmin/text_vec/tf-vectors
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:63)
at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:194)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
Caused by: java.io.FileNotFoundException:
/user/sgeadmin/harry_old_mallet_vec/tf-vectors (Is a directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:137)
at
org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>(RawLocalFileSystem.java:72)
at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(RawLocalFileSystem.java:108)
at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:178)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:127)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284)
at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1452)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1431)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1419)
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init>(SequenceFileIterator.java:58)
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61)
... 15 more

Kind Regards,
Nilu

On 31 July 2012 23:59, DAN HELM <da...@verizon.net> wrote:

> Folcon,
>
> seqdirectory should also read files in subfolders.
>
> Did you verify that recent seqdirectory command did in fact generate
> non-empty sequence files?  I believe seqdirectory command just assumes
> each file contains a single document (no concatenated documents per
> file), and that each file contains basic text.
>
> If it did generate sequence files this time, I am assume your folder
> "/user/sgeadmin/text_seq" was copied to hdfs (if not already there) before
> you ran seq2sparse on it?
>
> Dan
>
>    *From:* Folcon Red <fo...@gmail.com>
> *To:* DAN HELM <da...@verizon.net>
> *Cc:* Jake Mannix <ja...@gmail.com>; "user@mahout.apache.org" <
> user@mahout.apache.org>
> *Sent:* Tuesday, July 31, 2012 1:34 PM
>
> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>
> So part-r-00000 inside text_vec is
> still SEQorg.apache.hadoop.io.Text%org.apache.mahout.math.VectorWritable
> even after moving all the training files into a single folder.
>
> Regards,
> Folcon
>
> On 31 July 2012 18:18, Folcon Red <fo...@gmail.com> wrote:
>
> > Hey Everyone,
> >
> > Ok not certain why  $MAHOUT_HOME/bin/mahout seqdirectory --input /user/
> > sgeadmin/text_train --output /user/sgeadmin/text_seq -c UTF-8 -ow  didn't
>
> > produce sequence files, just looking inside text_seq only gives me:
> >
> > SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
> >
> > and that's it. Any ideas what I've been doing wrong? Maybe it's because I
> > have the files nested in the folder by class, for example a tree view of
> > the directory would look like.
> >
> > text_train -+
> >                | A -+
> >                        | 100
> >                        | 101
> >                        | 103
> >                | B -+
> >                        | 102
> >                        | 105
> >                        | 106
> >
> > So it's not picking them up? Or perhaps something else? I'm going to try
> > some variations to see what happens.
> >
> > Thanks for the help so far!
> >
> > Regards,
> > Folcon
> >
> >
> > On 29 July 2012 22:10, Folcon Red <fo...@gmail.com> wrote:
> >
> >> Right, well here's something promising, running $MAHOUT_HOME/bin/mahout
> >> seqdumper -i /user/sgeadmin/text_lda/part-m-00000 produced:
> >>
> >>
> >>
> 7:NaN,29478:NaN,29479:NaN,29480:NaN,29481:NaN,29482:NaN,29483:NaN,29484:NaN,29485:NaN,29486:NaN,29487:NaN,29488:NaN,29489:NaN,29490:NaN,29491:NaN,29492:NaN,29493:NaN,29494:NaN,29495:NaN,29496:NaN,29497:NaN,29498:NaN,29499:NaN,29500:NaN,29501:NaN,29502:NaN,29503:NaN,29504:NaN,29505:NaN,29506:NaN,29507:NaN,29508:NaN,29509:NaN,29510:NaN,29511:NaN,29512:NaN,29513:NaN,29514:NaN,29515:NaN,29516:NaN,29517:NaN,29518:NaN,29519:NaN,29520:NaN,29521:NaN,29522:NaN,29523:NaN,29524:NaN,29525:NaN,29526:NaN,29527:NaN,29528:NaN,29529:NaN,29530:NaN,29531:NaN,29532:NaN,29533:NaN,29534:NaN,29535:NaN}
> >>
> >> And $MAHOUT_HOME/bin/mahout seqdumper -i
> >> /user/sgeadmin/text_vec/tf-vectors/part-r-00000 produced:
> >>
> >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> >> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
> >> HADOOP_CONF_DIR=/usr/lib/hadoop/conf
> >> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
> >> 12/07/29 21:09:17 INFO common.AbstractJob: Command line arguments:
> >> {--endPhase=[2147483647],
> >> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
> >> --startPhase=[0], --tempDir=[temp]}
> >> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
> >> Key class: class org.apache.hadoop.io.Text Value Class: class
> >> org.apache.mahout.math.VectorWritable
> >> Count: 0
> >>
> >> Kind Regards,
> >> Folcon
> >>
> >> On 29 July 2012 21:29, DAN HELM <da...@verizon.net> wrote:
> >>
> >>> Yep something went wrong, most likely with the clustering.  part file
> is
> >>> empty.  Should look something like this:
> >>>
> >>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> >>> org.apache.mahout.math.VectorWritable
> >>> Key: 0: Value:
> >>>
> {0:0.06475650422868284,1:0.010728747158503565,2:0.005463535698651016,3:0.023451709705466457}
> >>> Key: 1: Value:
> >>>
> {0:0.01838885430227092,1:0.05068404879399544,2:0.002110418548647133,3:0.005566514441743756}
> >>> Key: 2: Value:
> >>>
> {0:0.018575587065216153,1:1.236602313900785E-5,2:8.654629660837919E-6,3:5.820637306957196E-6}
> >>> ...
> >>> ...
> >>>
> >>> Key refers to a document id and the Value are topic ids:weights
> assigned
> >>> to document id.
> >>>
> >>> So you need to figure out where things went wrong.  I'm assume folder
> >>> /user/sgeadmin/text_lda also has empty part files?  Assuming parts
> >>> files are there run seqdumper on one.  Should have data like the above
> >>> except in this case the key will be a topic id and the vector will be
> term
> >>> ids:weights.
> >>>
> >>> You can also check folder /user/sgeadmin/text_vec/tf-vectors to make
> >>> sure sparse vectors were generated for your input to cvb.
> >>>
> >>> Dan
> >>>
> >>>    *From:* Folcon Red <fo...@gmail.com>
> >>> *To:* DAN HELM <da...@verizon.net>
> >>> *Cc:* Jake Mannix <ja...@gmail.com>; "user@mahout.apache.org" <
> >>> user@mahout.apache.org>
> >>> *Sent:* Sunday, July 29, 2012 3:35 PM
> >>>
> >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>
> >>>
> >>> Thanks Dan and Jake,
> >>>
> >>> The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/
> >>> sgeadmin/text_cvb_document/part-m-00000 is:
> >>>
> >>> Input Path: /user/sgeadmin/text_cvb_document/part-m-00000
> >>> Key class: class org.apache.hadoop.io <
> http://org.apache.hadoop.io.int/>
>
> >>> .IntWritable Value Class: class org.apache.mahout.math.VectorWritable
> >>> Count: 0
> >>>
> >>> I'm not certain what went wrong.
> >>>
> >>> Kind Regards,
> >>> Folcon
> >>>
> >>> On 29 July 2012 18:49, DAN HELM <da...@verizon.net> wrote:
> >>>
> >>> Folcon,
> >>>
> >>> I'm still using Mahout 0.6 so don't know much about changes in 0.7.
> >>>
> >>> Your output folder for "dt" looks correct.  The relevant data would be
> >>> in  /user/sgeadmin/text_cvb_document/part-m-00000 which is what I would
> >>> be passing to a "-s" option.  But I see it says size is only 97 so that
> >>> looks suspicious.  So you can just view file (for starters) as: mahout
> >>> seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000.  And the
> >>> vector dumper command (as Jake pointed out) has a lot more options to
> >>> post-process the data but you may want to first just see what is in
> >>> that file.
> >>>
> >>> Dan
> >>>
> >>>    *From:* Folcon Red <fo...@gmail.com>
> >>> *To:* Jake Mannix <ja...@gmail.com>
> >>> *Cc:* user@mahout.apache.org; DAN HELM <da...@verizon.net>
> >>> *Sent:* Sunday, July 29, 2012 1:08 PM
> >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>
> >>>
> >>> Hi Guys,
> >>>
> >>> Thanks for replying, the problem is whenever I use any -s flag I get
> the
> >>> error "Unexpected -s while processing Job-Specific Options:"
> >>>
> >>> Also I'm not sure if this is supposed to be the output of -dt
> >>>
> >>> sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop
> >>> starcluster
> >>> Found 3 items
> >>> -rw-r--r--  3 sgeadmin supergroup          0 2012-07-29 16:51 /user/
> >>> sgeadmin/text_cvb_document/_SUCCESS
> >>> drwxr-xr-x  - sgeadmin supergroup          0 2012-07-29 16:50 /user/
> >>> sgeadmin/text_cvb_document/_logs
> >>> -rw-r--r--  3 sgeadmin supergroup        97 2012-07-29 16:51 /user/
> >>> sgeadmin/text_cvb_document/part-m-00000
> >>>
> >>> Should I be using a newer version of mahout? I've just been using the
> >>> 0.7 distribution so far as apparently the compiled versions are missing
> >>> parts that the distributed ones have.
> >>>
> >>> Kind Regards,
> >>> Folcon
> >>>
> >>> PS: Thanks for the help so far!
> >>>
> >>> On 29 July 2012 04:52, Jake Mannix <ja...@gmail.com> wrote:
> >>>
> >>>
> >>>
> >>> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <danielhelm@verizon.net
> >wrote:
> >>>
> >>> Hi Folcon,
> >>>
> >>> In the folder you specified for the –dt option for cvb command
> >>> there should be sequence files with the document to topic associations
> >>> (Key:
> >>> IntWritable, Value: VectorWritable).
> >>>
> >>>
> >>> Yeah, this is correct, although this:
> >>>
> >>>
> >>> You can dump in text format as: mahout seqdumper –s <sequence file>
> >>>
> >>>
> >>> is not as good as using vectordumper:
> >>>
> >>>    mahout vectordump -s <sequence file> --dictionary <path to
> dictionary.file-0>
> >>> \
> >>>        --dictionaryType seqfile --vectorSize <num entries per topic you
> >>> want to see> -sort
> >>>
> >>> This joins your topic vectors with the dictionary, then picks out the
> >>> top k terms (with their
> >>> probabilities) for each topic and prints them to the console (or to the
> >>> file you specify with
> >>> an --output option).
> >>>
> >>> *although* I notice now that in trunk when I just checked,
> VectorDumper.java
> >>> had a bug
> >>> in it for "vectorSize" - line 175 asks for cmdline option "
> >>> numIndexesPerVector" not
> >>> vectorSize, ack!  So I took the liberty of fixing that, but you'll need
> >>> to "svn up" and rebuild
> >>> your jar before using vectordump like this.
> >>>
> >>>
> >>>  So in text output from seqdumper, the key is a document id and the
> >>> vector contains
> >>> the topics and associated scores associated with the document.  I think
> >>> all topics are listed for each
> >>> document but many with near zero score.
> >>> In my case I used rowid to convert keys of original sparse
> >>> document vectors from Text to Integer before running cvb and this
> >>> generates a mapping file so I know the textual
> >>> keys that correspond to the numeric document ids (since my original
> >>> document ids were file names and I created named vectors).
> >>> Hope this helps.
> >>> Dan
> >>>
> >>> ________________________________
> >>>
> >>>  From: Folcon <fo...@gmail.com>
> >>> To: user@mahout.apache.org
> >>> Sent: Saturday, July 28, 2012 8:28 PM
> >>> Subject: Using Mahout to train an CVB and retrieve it's topics
> >>>
> >>> Hi Everyone,
> >>>
> >>> I'm posting this as my original message did not seem to appear on the
> >>> mailing
> >>> list, I'm very sorry if I have done this in error.
> >>>
> >>> I'm doing this to then use the topics to train a maxent algorithm to
> >>> predict the
> >>> classes of documents given their topic mixtures. Any further aid in
> this
> >>> direction would be appreciated!
> >>>
> >>> I've been trying to extract the topics out of my run of cvb. Here's
> >>> what I did
> >>> so far.
> >>>
> >>> Ok, so I still don't know how to output the topics, but I have worked
> >>> out how to
> >>> get the cvb and what I think are the document vectors, however I'm not
> >>> having
> >>> any luck dumping them, so help here would still be appreciated!
> >>>
> >>> I set the values of:
> >>>    export MAHOUT_HOME=/home/sgeadmin/mahout
> >>>    export HADOOP_HOME=/usr/lib/hadoop
> >>>    export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
> >>>    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
> >>> on the master otherwise none of this works.
> >>>
> >>> So first I uploaded the documents using starclusters put:
> >>>    starcluster put mycluster text_train /home/sgeadmin/
> >>>    starcluster put mycluster text_test /home/sgeadmin/
> >>>
> >>> Then I added them to hadoop's hbase filesystem:
> >>>    dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop
> >>> starcluster
> >>>
> >>> Then I called Mahout's seqdirectory to turn the text into sequence
> files
> >>>    $MAHOUT_HOME/bin/mahout seqdirectory --input
> /user/sgeadmin/text_train
> >>> --
> >>> output /user/sgeadmin/text_seq -c UTF-8 -ow
> >>>
> >>> Then I called Mahout's seq2parse to turn them into vectors
> >>>    $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin
> >>> /text_vec -
> >>> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
> >>>
> >>> Finally I called cvb, I believe that the -dt flag states where the
> >>> inferred
> >>> topics should go, but because I haven't yet been able to dump them I
> >>> can't
> >>> confirm this.
> >>>    $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o
> >>> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
> >>> /user/sgeadmin/text_vec/dictionary.file-0 -dt
> /user/sgeadmin/text_cvb_document
> >>> -
> >>> mt /user/sgeadmin/text_states
> >>>
> >>> The -k flag is the number of topics, the -nt flag is the size of the
> >>> dictionary,
> >>> I computed this by counting the number of entries of the
> >>> dictionary.file-0
> >>> inside the vectors(in this case under /user/sgeadmin/text_vec) and -x
> >>> is the
> >>> number of iterations.
> >>>
> >>> If you know how to get what the document topic probabilities are from
> >>> here, help
> >>> would be most appreciated!
> >>>
> >>> Kind Regards,
> >>> Folcon
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>>  -jake
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >
>
>
>
>

Re: Using Mahout to train an CVB and retrieve it's topics

Posted by Folcon Red <fo...@gmail.com>.

Hi Dan,

It's good to know that seqdirectory reads files in subfolders and I've
dumped out some of the values in the hopes that they will be
enlightening, The values seem to be missing for both the text_seq and
the tokenized-documents.

So rerunning some of the commands:
$MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train
--output /user/sgeadmin/text_seq -c UTF-8 -ow
$MAHOUT_HOME/bin/mahout seq2sparse -i /user/sgeadmin/text_seq -o
/user/sgeadmin/text_vec -wt tf -a
org.apache.lucene.analysis.WhitespaceAnalyzer -ow

And then doing a seqdumper of text_seq:
SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
[...]
Key: /high/59734: Value:
Key: /high/264596: Value:
Key: /high/341699: Value:
Key: /high/260770: Value:
Key: /high/222320: Value:
Key: /high/198156: Value:
Key: /high/326011: Value:
Key: /high/112050: Value:
Key: /high/306887: Value:
Key: /high/208169: Value:
Key: /high/283464: Value:
Key: /high/168905: Value:
Count: 2548

root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout seqdumper -i
/user/sgeadmin/text_vec/tf-vectors/part-r-00000
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
HADOOP_CONF_DIR=/conf
MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
12/07/31 23:23:34 INFO common.AbstractJob: Command line arguments:
{--endPhase=[2147483647],
--input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
--startPhase=[0], --tempDir=[temp]}
Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.mahout.math.VectorWritable
Count: 0

$MAHOUT_HOME/bin/mahout seqdumper -i
/user/sgeadmin/text_vec/tokenized-documents/part-m-00000
[...]
Key: /high/396063: Value: []
Key: /high/230246: Value: []
Key: /high/136284: Value: []
Key: /high/59734: Value: []
Key: /high/264596: Value: []
Key: /high/341699: Value: []
Key: /high/260770: Value: []
Key: /high/222320: Value: []
Key: /high/198156: Value: []
Key: /high/326011: Value: []
Key: /high/112050: Value: []
Key: /high/306887: Value: []
Key: /high/208169: Value: []
Key: /high/283464: Value: []
Key: /high/168905: Value: []
Count: 2548


Running vectordump on the text_vec folder like so:
root@master:/home/sgeadmin/corpora# $MAHOUT_HOME/bin/mahout vectordump -i
/user/sgeadmin/text_vec
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
HADOOP_CONF_DIR=/conf
MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
12/07/31 23:21:08 INFO common.AbstractJob: Command line arguments:
{--endPhase=[2147483647], --input=[/user/sgeadmin/text_vec],
--startPhase=[0], --tempDir=[temp]}
12/07/31 23:21:08 INFO vectors.VectorDumper: Sort? false
Exception in thread "main" java.lang.IllegalStateException:
file:/user/sgeadmin/text_vec/tf-vectors
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:63)
at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:194)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
Caused by: java.io.FileNotFoundException:
/user/sgeadmin/harry_old_mallet_vec/tf-vectors (Is a directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:137)
at
org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>(RawLocalFileSystem.java:72)
at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(RawLocalFileSystem.java:108)
at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:178)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:127)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284)
at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1452)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1431)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1419)
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init>(SequenceFileIterator.java:58)
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61)
... 15 more

Kind Regards,
Nilu

On 31 July 2012 23:59, DAN HELM <da...@verizon.net> wrote:

> Folcon,
>
> seqdirectory should also read files in subfolders.
>
> Did you verify that recent seqdirectory command did in fact generate
> non-empty sequence files?  I believe seqdirectory command just assumes
> each file contains a single document (no concatenated documents per
> file), and that each file contains basic text.
>
> If it did generate sequence files this time, I am assume your folder
> "/user/sgeadmin/text_seq" was copied to hdfs (if not already there) before
> you ran seq2sparse on it?
>
> Dan
>
>    *From:* Folcon Red <fo...@gmail.com>
> *To:* DAN HELM <da...@verizon.net>
> *Cc:* Jake Mannix <ja...@gmail.com>; "user@mahout.apache.org" <
> user@mahout.apache.org>
> *Sent:* Tuesday, July 31, 2012 1:34 PM
>
> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>
> So part-r-00000 inside text_vec is
> still SEQorg.apache.hadoop.io.Text%org.apache.mahout.math.VectorWritable
> even after moving all the training files into a single folder.
>
> Regards,
> Folcon
>
> On 31 July 2012 18:18, Folcon Red <fo...@gmail.com> wrote:
>
> > Hey Everyone,
> >
> > Ok not certain why  $MAHOUT_HOME/bin/mahout seqdirectory --input /user/
> > sgeadmin/text_train --output /user/sgeadmin/text_seq -c UTF-8 -ow  didn't
>
> > produce sequence files, just looking inside text_seq only gives me:
> >
> > SEQ org.apache.hadoop.io.Text org.apache.hadoop.io.Text
> >
> > and that's it. Any ideas what I've been doing wrong? Maybe it's because I
> > have the files nested in the folder by class, for example a tree view of
> > the directory would look like.
> >
> > text_train -+
> >                | A -+
> >                        | 100
> >                        | 101
> >                        | 103
> >                | B -+
> >                        | 102
> >                        | 105
> >                        | 106
> >
> > So it's not picking them up? Or perhaps something else? I'm going to try
> > some variations to see what happens.
> >
> > Thanks for the help so far!
> >
> > Regards,
> > Folcon
> >
> >
> > On 29 July 2012 22:10, Folcon Red <fo...@gmail.com> wrote:
> >
> >> Right, well here's something promising, running $MAHOUT_HOME/bin/mahout
> >> seqdumper -i /user/sgeadmin/text_lda/part-m-00000 produced:
> >>
> >>
> >>
> 7:NaN,29478:NaN,29479:NaN,29480:NaN,29481:NaN,29482:NaN,29483:NaN,29484:NaN,29485:NaN,29486:NaN,29487:NaN,29488:NaN,29489:NaN,29490:NaN,29491:NaN,29492:NaN,29493:NaN,29494:NaN,29495:NaN,29496:NaN,29497:NaN,29498:NaN,29499:NaN,29500:NaN,29501:NaN,29502:NaN,29503:NaN,29504:NaN,29505:NaN,29506:NaN,29507:NaN,29508:NaN,29509:NaN,29510:NaN,29511:NaN,29512:NaN,29513:NaN,29514:NaN,29515:NaN,29516:NaN,29517:NaN,29518:NaN,29519:NaN,29520:NaN,29521:NaN,29522:NaN,29523:NaN,29524:NaN,29525:NaN,29526:NaN,29527:NaN,29528:NaN,29529:NaN,29530:NaN,29531:NaN,29532:NaN,29533:NaN,29534:NaN,29535:NaN}
> >>
> >> And $MAHOUT_HOME/bin/mahout seqdumper -i
> >> /user/sgeadmin/text_vec/tf-vectors/part-r-00000 produced:
> >>
> >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> >> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
> >> HADOOP_CONF_DIR=/usr/lib/hadoop/conf
> >> MAHOUT-JOB: /home/sgeadmin/mahout/mahout-examples-0.7-job.jar
> >> 12/07/29 21:09:17 INFO common.AbstractJob: Command line arguments:
> >> {--endPhase=[2147483647],
> >> --input=[/user/sgeadmin/text_vec/tf-vectors/part-r-00000],
> >> --startPhase=[0], --tempDir=[temp]}
> >> Input Path: /user/sgeadmin/text_vec/tf-vectors/part-r-00000
> >> Key class: class org.apache.hadoop.io.Text Value Class: class
> >> org.apache.mahout.math.VectorWritable
> >> Count: 0
> >>
> >> Kind Regards,
> >> Folcon
> >>
> >> On 29 July 2012 21:29, DAN HELM <da...@verizon.net> wrote:
> >>
> >>> Yep something went wrong, most likely with the clustering.  part file
> is
> >>> empty.  Should look something like this:
> >>>
> >>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> >>> org.apache.mahout.math.VectorWritable
> >>> Key: 0: Value:
> >>>
> {0:0.06475650422868284,1:0.010728747158503565,2:0.005463535698651016,3:0.023451709705466457}
> >>> Key: 1: Value:
> >>>
> {0:0.01838885430227092,1:0.05068404879399544,2:0.002110418548647133,3:0.005566514441743756}
> >>> Key: 2: Value:
> >>>
> {0:0.018575587065216153,1:1.236602313900785E-5,2:8.654629660837919E-6,3:5.820637306957196E-6}
> >>> ...
> >>> ...
> >>>
> >>> Key refers to a document id and the Value are topic ids:weights
> assigned
> >>> to document id.
> >>>
> >>> So you need to figure out where things went wrong.  I'm assume folder
> >>> /user/sgeadmin/text_lda also has empty part files?  Assuming parts
> >>> files are there run seqdumper on one.  Should have data like the above
> >>> except in this case the key will be a topic id and the vector will be
> term
> >>> ids:weights.
> >>>
> >>> You can also check folder /user/sgeadmin/text_vec/tf-vectors to make
> >>> sure sparse vectors were generated for your input to cvb.
> >>>
> >>> Dan
> >>>
> >>>    *From:* Folcon Red <fo...@gmail.com>
> >>> *To:* DAN HELM <da...@verizon.net>
> >>> *Cc:* Jake Mannix <ja...@gmail.com>; "user@mahout.apache.org" <
> >>> user@mahout.apache.org>
> >>> *Sent:* Sunday, July 29, 2012 3:35 PM
> >>>
> >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>
> >>>
> >>> Thanks Dan and Jake,
> >>>
> >>> The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/
> >>> sgeadmin/text_cvb_document/part-m-00000 is:
> >>>
> >>> Input Path: /user/sgeadmin/text_cvb_document/part-m-00000
> >>> Key class: class org.apache.hadoop.io <
> http://org.apache.hadoop.io.int/>
>
> >>> .IntWritable Value Class: class org.apache.mahout.math.VectorWritable
> >>> Count: 0
> >>>
> >>> I'm not certain what went wrong.
> >>>
> >>> Kind Regards,
> >>> Folcon
> >>>
> >>> On 29 July 2012 18:49, DAN HELM <da...@verizon.net> wrote:
> >>>
> >>> Folcon,
> >>>
> >>> I'm still using Mahout 0.6 so don't know much about changes in 0.7.
> >>>
> >>> Your output folder for "dt" looks correct.  The relevant data would be
> >>> in  /user/sgeadmin/text_cvb_document/part-m-00000 which is what I would
> >>> be passing to a "-s" option.  But I see it says size is only 97 so that
> >>> looks suspicious.  So you can just view file (for starters) as: mahout
> >>> seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000.  And the
> >>> vector dumper command (as Jake pointed out) has a lot more options to
> >>> post-process the data but you may want to first just see what is in
> >>> that file.
> >>>
> >>> Dan
> >>>
> >>>    *From:* Folcon Red <fo...@gmail.com>
> >>> *To:* Jake Mannix <ja...@gmail.com>
> >>> *Cc:* user@mahout.apache.org; DAN HELM <da...@verizon.net>
> >>> *Sent:* Sunday, July 29, 2012 1:08 PM
> >>> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>
> >>>
> >>> Hi Guys,
> >>>
> >>> Thanks for replying, the problem is whenever I use any -s flag I get
> the
> >>> error "Unexpected -s while processing Job-Specific Options:"
> >>>
> >>> Also I'm not sure if this is supposed to be the output of -dt
> >>>
> >>> sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop
> >>> starcluster
> >>> Found 3 items
> >>> -rw-r--r--  3 sgeadmin supergroup          0 2012-07-29 16:51 /user/
> >>> sgeadmin/text_cvb_document/_SUCCESS
> >>> drwxr-xr-x  - sgeadmin supergroup          0 2012-07-29 16:50 /user/
> >>> sgeadmin/text_cvb_document/_logs
> >>> -rw-r--r--  3 sgeadmin supergroup        97 2012-07-29 16:51 /user/
> >>> sgeadmin/text_cvb_document/part-m-00000
> >>>
> >>> Should I be using a newer version of mahout? I've just been using the
> >>> 0.7 distribution so far as apparently the compiled versions are missing
> >>> parts that the distributed ones have.
> >>>
> >>> Kind Regards,
> >>> Folcon
> >>>
> >>> PS: Thanks for the help so far!
> >>>
> >>> On 29 July 2012 04:52, Jake Mannix <ja...@gmail.com> wrote:
> >>>
> >>>
> >>>
> >>> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <danielhelm@verizon.net
> >wrote:
> >>>
> >>> Hi Folcon,
> >>>
> >>> In the folder you specified for the –dt option for cvb command
> >>> there should be sequence files with the document to topic associations
> >>> (Key:
> >>> IntWritable, Value: VectorWritable).
> >>>
> >>>
> >>> Yeah, this is correct, although this:
> >>>
> >>>
> >>> You can dump in text format as: mahout seqdumper –s <sequence file>
> >>>
> >>>
> >>> is not as good as using vectordumper:
> >>>
> >>>    mahout vectordump -s <sequence file> --dictionary <path to
> dictionary.file-0>
> >>> \
> >>>        --dictionaryType seqfile --vectorSize <num entries per topic you
> >>> want to see> -sort
> >>>
> >>> This joins your topic vectors with the dictionary, then picks out the
> >>> top k terms (with their
> >>> probabilities) for each topic and prints them to the console (or to the
> >>> file you specify with
> >>> an --output option).
> >>>
> >>> *although* I notice now that in trunk when I just checked,
> VectorDumper.java
> >>> had a bug
> >>> in it for "vectorSize" - line 175 asks for cmdline option "
> >>> numIndexesPerVector" not
> >>> vectorSize, ack!  So I took the liberty of fixing that, but you'll need
> >>> to "svn up" and rebuild
> >>> your jar before using vectordump like this.
> >>>
> >>>
> >>>  So in text output from seqdumper, the key is a document id and the
> >>> vector contains
> >>> the topics and associated scores associated with the document.  I think
> >>> all topics are listed for each
> >>> document but many with near zero score.
> >>> In my case I used rowid to convert keys of original sparse
> >>> document vectors from Text to Integer before running cvb and this
> >>> generates a mapping file so I know the textual
> >>> keys that correspond to the numeric document ids (since my original
> >>> document ids were file names and I created named vectors).
> >>> Hope this helps.
> >>> Dan
> >>>
> >>> ________________________________
> >>>
> >>>  From: Folcon <fo...@gmail.com>
> >>> To: user@mahout.apache.org
> >>> Sent: Saturday, July 28, 2012 8:28 PM
> >>> Subject: Using Mahout to train an CVB and retrieve it's topics
> >>>
> >>> Hi Everyone,
> >>>
> >>> I'm posting this as my original message did not seem to appear on the
> >>> mailing
> >>> list, I'm very sorry if I have done this in error.
> >>>
> >>> I'm doing this to then use the topics to train a maxent algorithm to
> >>> predict the
> >>> classes of documents given their topic mixtures. Any further aid in
> this
> >>> direction would be appreciated!
> >>>
> >>> I've been trying to extract the topics out of my run of cvb. Here's
> >>> what I did
> >>> so far.
> >>>
> >>> Ok, so I still don't know how to output the topics, but I have worked
> >>> out how to
> >>> get the cvb and what I think are the document vectors, however I'm not
> >>> having
> >>> any luck dumping them, so help here would still be appreciated!
> >>>
> >>> I set the values of:
> >>>    export MAHOUT_HOME=/home/sgeadmin/mahout
> >>>    export HADOOP_HOME=/usr/lib/hadoop
> >>>    export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
> >>>    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
> >>> on the master otherwise none of this works.
> >>>
> >>> So first I uploaded the documents using starclusters put:
> >>>    starcluster put mycluster text_train /home/sgeadmin/
> >>>    starcluster put mycluster text_test /home/sgeadmin/
> >>>
> >>> Then I added them to hadoop's hbase filesystem:
> >>>    dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop
> >>> starcluster
> >>>
> >>> Then I called Mahout's seqdirectory to turn the text into sequence
> files
> >>>    $MAHOUT_HOME/bin/mahout seqdirectory --input
> /user/sgeadmin/text_train
> >>> --
> >>> output /user/sgeadmin/text_seq -c UTF-8 -ow
> >>>
> >>> Then I called Mahout's seq2parse to turn them into vectors
> >>>    $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin
> >>> /text_vec -
> >>> wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
> >>>
> >>> Finally I called cvb, I believe that the -dt flag states where the
> >>> inferred
> >>> topics should go, but because I haven't yet been able to dump them I
> >>> can't
> >>> confirm this.
> >>>    $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o
> >>> /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
> >>> /user/sgeadmin/text_vec/dictionary.file-0 -dt
> /user/sgeadmin/text_cvb_document
> >>> -
> >>> mt /user/sgeadmin/text_states
> >>>
> >>> The -k flag is the number of topics, the -nt flag is the size of the
> >>> dictionary,
> >>> I computed this by counting the number of entries of the
> >>> dictionary.file-0
> >>> inside the vectors(in this case under /user/sgeadmin/text_vec) and -x
> >>> is the
> >>> number of iterations.
> >>>
> >>> If you know how to get what the document topic probabilities are from
> >>> here, help
> >>> would be most appreciated!
> >>>
> >>> Kind Regards,
> >>> Folcon
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>>  -jake
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >
>
>
>
>