You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by wine lover <wi...@gmail.com> on 2011/07/07 19:53:10 UTC

how to transfer the sequence file into readable format

Dear All,

After running LDA analysis, I got the docTopic file, which is a regular
sequence-file. How to transfer it into a readable format? I searched
vectordumper, or vectordump, but did not get any useful results, such as how
to use it in command-line? Thanks.

RE: how to transfer the sequence file into readable format

Posted by Jeff Eastman <je...@Narus.com>.
Haha, well ok, so maybe Dhruv will be motivated to submit a patch to add it exactly the way he wants to see it. The ClusterDumper has this as an option, since there are generally a lot more vectors than clusters. It also can write this output to a file or to the transcript, IIRC. What if they had similar CLI arguments?

-----Original Message-----
From: Jake Mannix [mailto:jake.mannix@gmail.com] 
Sent: Thursday, July 07, 2011 2:32 PM
To: user@mahout.apache.org
Subject: Re: how to transfer the sequence file into readable format

Does LDAPrintTopics print the *document*-topic probabilities, or just
the *term*-topic probabilities?  I thought only the latter, because I was
too
lazy (sorry!) to update it to add in the ability to put the former as well
when
I added docTopics to the LDA output.

On Thu, Jul 7, 2011 at 8:24 PM, Jeff Eastman <je...@narus.com> wrote:

> I think you want LDAPrintTopics?
>
> -----Original Message-----
> From: dhruv21@gmail.com [mailto:dhruv21@gmail.com] On Behalf Of Dhruv
> Kumar
> Sent: Thursday, July 07, 2011 11:29 AM
> To: user@mahout.apache.org
> Subject: Re: how to transfer the sequence file into readable format
>
> Sequence Files store key and value pairs in a binary, compressed format. To
> read a sequence file and display the key and values in a human format, you
> can use SequenceFile Reader:
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.Reader.html
>
> I don't know the outputs of LDA, but in general you can do the following,
> assuming key is IntWritable and value is DoubleWritable.
>
> Configuration conf = new Configuration();
> FileSystem fs = FileSystem.get(conf);
> SequenceFile.reader reader = new SequenceFile.reader(fs, new
> Path("/path/to/output/of/LDA"), conf);
> IntWritable key = new IntWritable();
> DoubleWritable value = new DoubleWritable();
>
> while(reader.next(key, value)) {
>  System.out.println(key.toString(), value.toString());
> }
> reader.close();
>
>
> There may be a convenient command line utility for LDA also which someone
> else can point out. However, you can always write your own simple class as
> shown above for reading any Sequence File.
>
>
>
>
>
> On Thu, Jul 7, 2011 at 1:53 PM, wine lover <wi...@gmail.com> wrote:
>
> > Dear All,
> >
> > After running LDA analysis, I got the docTopic file, which is a regular
> > sequence-file. How to transfer it into a readable format? I searched
> > vectordumper, or vectordump, but did not get any useful results, such as
> > how
> > to use it in command-line? Thanks.
> >
>

Re: how to transfer the sequence file into readable format

Posted by Jake Mannix <ja...@gmail.com>.
Does LDAPrintTopics print the *document*-topic probabilities, or just
the *term*-topic probabilities?  I thought only the latter, because I was
too
lazy (sorry!) to update it to add in the ability to put the former as well
when
I added docTopics to the LDA output.

On Thu, Jul 7, 2011 at 8:24 PM, Jeff Eastman <je...@narus.com> wrote:

> I think you want LDAPrintTopics?
>
> -----Original Message-----
> From: dhruv21@gmail.com [mailto:dhruv21@gmail.com] On Behalf Of Dhruv
> Kumar
> Sent: Thursday, July 07, 2011 11:29 AM
> To: user@mahout.apache.org
> Subject: Re: how to transfer the sequence file into readable format
>
> Sequence Files store key and value pairs in a binary, compressed format. To
> read a sequence file and display the key and values in a human format, you
> can use SequenceFile Reader:
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.Reader.html
>
> I don't know the outputs of LDA, but in general you can do the following,
> assuming key is IntWritable and value is DoubleWritable.
>
> Configuration conf = new Configuration();
> FileSystem fs = FileSystem.get(conf);
> SequenceFile.reader reader = new SequenceFile.reader(fs, new
> Path("/path/to/output/of/LDA"), conf);
> IntWritable key = new IntWritable();
> DoubleWritable value = new DoubleWritable();
>
> while(reader.next(key, value)) {
>  System.out.println(key.toString(), value.toString());
> }
> reader.close();
>
>
> There may be a convenient command line utility for LDA also which someone
> else can point out. However, you can always write your own simple class as
> shown above for reading any Sequence File.
>
>
>
>
>
> On Thu, Jul 7, 2011 at 1:53 PM, wine lover <wi...@gmail.com> wrote:
>
> > Dear All,
> >
> > After running LDA analysis, I got the docTopic file, which is a regular
> > sequence-file. How to transfer it into a readable format? I searched
> > vectordumper, or vectordump, but did not get any useful results, such as
> > how
> > to use it in command-line? Thanks.
> >
>

RE: how to transfer the sequence file into readable format

Posted by Jeff Eastman <je...@Narus.com>.
I think you want LDAPrintTopics?

-----Original Message-----
From: dhruv21@gmail.com [mailto:dhruv21@gmail.com] On Behalf Of Dhruv Kumar
Sent: Thursday, July 07, 2011 11:29 AM
To: user@mahout.apache.org
Subject: Re: how to transfer the sequence file into readable format

Sequence Files store key and value pairs in a binary, compressed format. To
read a sequence file and display the key and values in a human format, you
can use SequenceFile Reader:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.Reader.html

I don't know the outputs of LDA, but in general you can do the following,
assuming key is IntWritable and value is DoubleWritable.

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
SequenceFile.reader reader = new SequenceFile.reader(fs, new
Path("/path/to/output/of/LDA"), conf);
IntWritable key = new IntWritable();
DoubleWritable value = new DoubleWritable();

while(reader.next(key, value)) {
  System.out.println(key.toString(), value.toString());
}
reader.close();


There may be a convenient command line utility for LDA also which someone
else can point out. However, you can always write your own simple class as
shown above for reading any Sequence File.





On Thu, Jul 7, 2011 at 1:53 PM, wine lover <wi...@gmail.com> wrote:

> Dear All,
>
> After running LDA analysis, I got the docTopic file, which is a regular
> sequence-file. How to transfer it into a readable format? I searched
> vectordumper, or vectordump, but did not get any useful results, such as
> how
> to use it in command-line? Thanks.
>

Re: how to transfer the sequence file into readable format

Posted by Dhruv Kumar <dk...@ecs.umass.edu>.
Sequence Files store key and value pairs in a binary, compressed format. To
read a sequence file and display the key and values in a human format, you
can use SequenceFile Reader:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.Reader.html

I don't know the outputs of LDA, but in general you can do the following,
assuming key is IntWritable and value is DoubleWritable.

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
SequenceFile.reader reader = new SequenceFile.reader(fs, new
Path("/path/to/output/of/LDA"), conf);
IntWritable key = new IntWritable();
DoubleWritable value = new DoubleWritable();

while(reader.next(key, value)) {
  System.out.println(key.toString(), value.toString());
}
reader.close();


There may be a convenient command line utility for LDA also which someone
else can point out. However, you can always write your own simple class as
shown above for reading any Sequence File.





On Thu, Jul 7, 2011 at 1:53 PM, wine lover <wi...@gmail.com> wrote:

> Dear All,
>
> After running LDA analysis, I got the docTopic file, which is a regular
> sequence-file. How to transfer it into a readable format? I searched
> vectordumper, or vectordump, but did not get any useful results, such as
> how
> to use it in command-line? Thanks.
>

Re: how to transfer the sequence file into readable format

Posted by Jake Mannix <ja...@gmail.com>.
On Mon, Jul 11, 2011 at 8:15 AM, Dhruv Kumar <dk...@ecs.umass.edu> wrote:

> On Fri, Jul 8, 2011 at 11:05 AM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > At the end of the exception trace, you should see the list of options
> which
> > it will
> > take.  As I said, it's missing a "--help" option, but all of the mahout
> > programs,
> > if given an incorrect argument, will give this stack trace, followed by
> the
> > list of arguments you *could* use.
> >
>
> Seems to violate the principle of least astonishment.
>

Of course it does, which is why I said that it was a bug in that particular
script.


> If this is a systemic issue with all the command line scripts, I think we
> should create a JIRA issue for it. I can work on it on the side with my
> GSOC
> project.
>

It is specific to seqdumper and vectordumper.  All other actions in the
script do the right thing, that I know of.


> Why does this happen in the first place?
>

I think it's a really simple, easy-to-fix issue: VectorDumper.java has a
line
in main():

Group group =
gbuilder.withName("Options").withOption(seqOpt).withOption(outputOpt)

.withOption(dictTypeOpt).withOption(dictOpt).withOption(csvOpt).withOption(vectorAsKeyOpt)
  .withOption(printKeyOpt).withOption(sizeOpt).create();

but it does not have a "withOption(helpOpt)" which was defined above, and so
it never checks for
this option when parsing.  Adding this line should make --help do the right
thing.

  -jake


>
> >
> > In this case, they're printed below, I'll cut the part out you need:
> >
> > ---------------
> > Usage:
> >
> >  [--seqFile <seqFile> --output <output> --dictionaryType <dictionaryType>
> > --dictionary <dictionary> --csv --useKey --printKey --sizeOnly]
> >
> > Options
> >
> >  --seqFile (-s) seqFile                   The Sequence File
> > containing the Vectors
> >  --output (-o) output                      The output file.  If
> > not specified,
> >                                                  dumps to the console
> >  --dictionaryType (-dt) dictionaryType    The dictionary
> > file type (text|sequencefile)
> >  --dictionary (-d) dictionary             The dictionary file.
> >  --csv (-c)                               Output the Vector as
> > CSV. Otherwise
> >                                          it substitutes in the terms for
> >                                          vector cell entries
> >  --useKey (-u)                            If the Key is a vector, then
> dump
> >                                          that instead
> >  --printKey (-p)                          Print out the key as
> > well, delimited
> >                                          by a tab (or the value if
> > useKey is true)
> >  --sizeOnly (-sz)                         Dump only the size of the
> vector
> >
> > ----------------
> >
> > This means you want to do:
> >
> > ./bin/mahout -s path_to_docTopics_output -o
> > path_you_want_to_write_text_output_to
> >
> > and then just look in path_you_want_to_write_text_output_to, and it
> should
> > have
> > what you want.
> >
> >  -jake
> >
> > On Fri, Jul 8, 2011 at 6:16 AM, huaiyang gongzi <
> huaiyanggongzi@gmail.com
> > >wrote:
> >
> > > Thanks, Jake. But after typing  mahout  vectordump --help,  I got sth
> > like
> > > this
> > >
> > > 11/07/08 09:14:25 ERROR vectors.VectorDumper: Exception
> > > org.apache.commons.cli2.OptionException: Unexpected --help while
> > processing
> > > Options
> > >        at
> > org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
> > >        at
> > >
> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:100)
> > >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >        at
> > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >        at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >        at java.lang.reflect.Method.invoke(Method.java:597)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > >        at
> > > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > >        at
> > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> > >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >        at
> > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >        at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >        at java.lang.reflect.Method.invoke(Method.java:597)
> > >        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> > > Usage:
> > >
> > >  [--seqFile <seqFile> --output <output> --dictionaryType
> > > <dictionaryType>
> > > --dictionary <dictionary> --csv --useKey --printKey
> > > --sizeOnly]
> > > Options
> > >
> > >  --seqFile (-s) seqFile                   The Sequence File containing
> > > the
> > >
> > > Vectors
> > >  --output (-o) output                     The output file.  If not
> > > specified,
> > >                                           dumps to the
> > > console
> > >  --dictionaryType (-dt) dictionaryType    The dictionary file
> > > type
> > >
> > > (text|sequencefile)
> > >  --dictionary (-d) dictionary             The dictionary
> > > file.
> > >  --csv (-c)                               Output the Vector as CSV.
> > > Otherwise
> > >                                           it substitutes in the terms
> > > for
> > >                                           vector cell
> > > entries
> > >  --useKey (-u)                            If the Key is a vector, then
> > > dump
> > >                                           that
> > > instead
> > >  --printKey (-p)                          Print out the key as well,
> > > delimited
> > >                                           by a tab (or the value if
> > useKey
> > > is
> > >
> > > true)
> > >  --sizeOnly (-sz)                         Dump only the size of the
> > > vector
> > > 11/07/08 09:14:25 INFO driver.MahoutDriver: Program took 30 ms
> > >
> > >
> > > On Thu, Jul 7, 2011 at 5:56 PM, Jake Mannix <ja...@gmail.com>
> > wrote:
> > >
> > > > On Thu, Jul 7, 2011 at 5:53 PM, wine lover <wi...@gmail.com>
> > wrote:
> > > >
> > > > > Dear All,
> > > > >
> > > > > After running LDA analysis, I got the docTopic file, which is a
> > regular
> > > > > sequence-file. How to transfer it into a readable format? I
> searched
> > > > > vectordumper, or vectordump, but did not get any useful results,
> such
> > > as
> > > > > how
> > > > > to use it in command-line? Thanks.
> > > > >
> > > >
> > > > So you say you "searched vectordumper/vectordump", you mean you
> > > > looked through the code looking for it, or you used it and it didn't
> do
> > > > what
> > > > you wanted?
> > > >
> > > > If you're just not sure how to use it, try running "./bin/mahout"
> from
> > > your
> > > > distribution directory, with no arguments, and it will print out a
> > bunch
> > > of
> > > > possible commands, one of which is vectordump.   If you try to run it
> > > > with no arguments, it will sadly exit silently, not telling you what
> > the
> > > > usage is (this is a bug!), but if you try to give it an illegal
> > argument,
> > > > like
> > > >
> > > > ./bin/mahout vectordump --help
> > > >
> > > > You'll see:
> > > > Usage:
> > > >
> > > >  [--seqFile <seqFile> --output <output> --dictionaryType
> > <dictionaryType>
> > > >
> > > > --dictionary <dictionary> --csv --useKey --printKey --sizeOnly]
> > > >
> > > > Options
> > > >
> > > >  --seqFile (-s) seqFile                   The Sequence File
> containing
> > > the
> > > >
> > > >                                           Vectors
> > > >
> > > >  --output (-o) output                     The output file.  If not
> > > > specified,
> > > >                                           dumps to the console
> > > >
> > > >  --dictionaryType (-dt) dictionaryType    The dictionary file type
> > > >
> > > >                                           (text|sequencefile)
> > > >
> > > >  --dictionary (-d) dictionary             The dictionary file.
> > > >
> > > >  --csv (-c)                               Output the Vector as CSV.
> > > >  Otherwise
> > > >                                           it substitutes in the terms
> > for
> > > >
> > > >                                           vector cell entries
> > > >
> > > >  --useKey (-u)                            If the Key is a vector,
> then
> > > dump
> > > >
> > > >                                           that instead
> > > >
> > > >  --printKey (-p)                          Print out the key as well,
> > > > delimited
> > > >                                           by a tab (or the value if
> > > useKey
> > > > is
> > > >                                           true)
> > > >
> > > >  --sizeOnly (-sz)                         Dump only the size of the
> > > vector
> > > >
> > > >
> > > > -----
> > > >
> > > > If you use these instructions to point to the docTopics output
> > location,
> > > > you can have it print out the p(topic | document) for each
> > topic/document
> > > > pair in your collection.
> > > >
> > > >  -jake
> > > >
> > >
> >
>

Re: how to transfer the sequence file into readable format

Posted by Dhruv Kumar <dk...@ecs.umass.edu>.
On Fri, Jul 8, 2011 at 11:05 AM, Jake Mannix <ja...@gmail.com> wrote:

> At the end of the exception trace, you should see the list of options which
> it will
> take.  As I said, it's missing a "--help" option, but all of the mahout
> programs,
> if given an incorrect argument, will give this stack trace, followed by the
> list of arguments you *could* use.
>

Seems to violate the principle of least astonishment.

If this is a systemic issue with all the command line scripts, I think we
should create a JIRA issue for it. I can work on it on the side with my GSOC
project.

Why does this happen in the first place?


>
> In this case, they're printed below, I'll cut the part out you need:
>
> ---------------
> Usage:
>
>  [--seqFile <seqFile> --output <output> --dictionaryType <dictionaryType>
> --dictionary <dictionary> --csv --useKey --printKey --sizeOnly]
>
> Options
>
>  --seqFile (-s) seqFile                   The Sequence File
> containing the Vectors
>  --output (-o) output                      The output file.  If
> not specified,
>                                                  dumps to the console
>  --dictionaryType (-dt) dictionaryType    The dictionary
> file type (text|sequencefile)
>  --dictionary (-d) dictionary             The dictionary file.
>  --csv (-c)                               Output the Vector as
> CSV. Otherwise
>                                          it substitutes in the terms for
>                                          vector cell entries
>  --useKey (-u)                            If the Key is a vector, then dump
>                                          that instead
>  --printKey (-p)                          Print out the key as
> well, delimited
>                                          by a tab (or the value if
> useKey is true)
>  --sizeOnly (-sz)                         Dump only the size of the vector
>
> ----------------
>
> This means you want to do:
>
> ./bin/mahout -s path_to_docTopics_output -o
> path_you_want_to_write_text_output_to
>
> and then just look in path_you_want_to_write_text_output_to, and it should
> have
> what you want.
>
>  -jake
>
> On Fri, Jul 8, 2011 at 6:16 AM, huaiyang gongzi <huaiyanggongzi@gmail.com
> >wrote:
>
> > Thanks, Jake. But after typing  mahout  vectordump --help,  I got sth
> like
> > this
> >
> > 11/07/08 09:14:25 ERROR vectors.VectorDumper: Exception
> > org.apache.commons.cli2.OptionException: Unexpected --help while
> processing
> > Options
> >        at
> org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
> >        at
> > org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:100)
> >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >        at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >        at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at
> >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >        at
> > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >        at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >        at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >        at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> > Usage:
> >
> >  [--seqFile <seqFile> --output <output> --dictionaryType
> > <dictionaryType>
> > --dictionary <dictionary> --csv --useKey --printKey
> > --sizeOnly]
> > Options
> >
> >  --seqFile (-s) seqFile                   The Sequence File containing
> > the
> >
> > Vectors
> >  --output (-o) output                     The output file.  If not
> > specified,
> >                                           dumps to the
> > console
> >  --dictionaryType (-dt) dictionaryType    The dictionary file
> > type
> >
> > (text|sequencefile)
> >  --dictionary (-d) dictionary             The dictionary
> > file.
> >  --csv (-c)                               Output the Vector as CSV.
> > Otherwise
> >                                           it substitutes in the terms
> > for
> >                                           vector cell
> > entries
> >  --useKey (-u)                            If the Key is a vector, then
> > dump
> >                                           that
> > instead
> >  --printKey (-p)                          Print out the key as well,
> > delimited
> >                                           by a tab (or the value if
> useKey
> > is
> >
> > true)
> >  --sizeOnly (-sz)                         Dump only the size of the
> > vector
> > 11/07/08 09:14:25 INFO driver.MahoutDriver: Program took 30 ms
> >
> >
> > On Thu, Jul 7, 2011 at 5:56 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> >
> > > On Thu, Jul 7, 2011 at 5:53 PM, wine lover <wi...@gmail.com>
> wrote:
> > >
> > > > Dear All,
> > > >
> > > > After running LDA analysis, I got the docTopic file, which is a
> regular
> > > > sequence-file. How to transfer it into a readable format? I searched
> > > > vectordumper, or vectordump, but did not get any useful results, such
> > as
> > > > how
> > > > to use it in command-line? Thanks.
> > > >
> > >
> > > So you say you "searched vectordumper/vectordump", you mean you
> > > looked through the code looking for it, or you used it and it didn't do
> > > what
> > > you wanted?
> > >
> > > If you're just not sure how to use it, try running "./bin/mahout" from
> > your
> > > distribution directory, with no arguments, and it will print out a
> bunch
> > of
> > > possible commands, one of which is vectordump.   If you try to run it
> > > with no arguments, it will sadly exit silently, not telling you what
> the
> > > usage is (this is a bug!), but if you try to give it an illegal
> argument,
> > > like
> > >
> > > ./bin/mahout vectordump --help
> > >
> > > You'll see:
> > > Usage:
> > >
> > >  [--seqFile <seqFile> --output <output> --dictionaryType
> <dictionaryType>
> > >
> > > --dictionary <dictionary> --csv --useKey --printKey --sizeOnly]
> > >
> > > Options
> > >
> > >  --seqFile (-s) seqFile                   The Sequence File containing
> > the
> > >
> > >                                           Vectors
> > >
> > >  --output (-o) output                     The output file.  If not
> > > specified,
> > >                                           dumps to the console
> > >
> > >  --dictionaryType (-dt) dictionaryType    The dictionary file type
> > >
> > >                                           (text|sequencefile)
> > >
> > >  --dictionary (-d) dictionary             The dictionary file.
> > >
> > >  --csv (-c)                               Output the Vector as CSV.
> > >  Otherwise
> > >                                           it substitutes in the terms
> for
> > >
> > >                                           vector cell entries
> > >
> > >  --useKey (-u)                            If the Key is a vector, then
> > dump
> > >
> > >                                           that instead
> > >
> > >  --printKey (-p)                          Print out the key as well,
> > > delimited
> > >                                           by a tab (or the value if
> > useKey
> > > is
> > >                                           true)
> > >
> > >  --sizeOnly (-sz)                         Dump only the size of the
> > vector
> > >
> > >
> > > -----
> > >
> > > If you use these instructions to point to the docTopics output
> location,
> > > you can have it print out the p(topic | document) for each
> topic/document
> > > pair in your collection.
> > >
> > >  -jake
> > >
> >
>

Re: how to transfer the sequence file into readable format

Posted by Jake Mannix <ja...@gmail.com>.
At the end of the exception trace, you should see the list of options which
it will
take.  As I said, it's missing a "--help" option, but all of the mahout
programs,
if given an incorrect argument, will give this stack trace, followed by the
list of arguments you *could* use.

In this case, they're printed below, I'll cut the part out you need:

---------------
Usage:

 [--seqFile <seqFile> --output <output> --dictionaryType <dictionaryType>
--dictionary <dictionary> --csv --useKey --printKey --sizeOnly]

Options

 --seqFile (-s) seqFile                   The Sequence File
containing the Vectors
 --output (-o) output                      The output file.  If
not specified,
                                                  dumps to the console
 --dictionaryType (-dt) dictionaryType    The dictionary
file type (text|sequencefile)
 --dictionary (-d) dictionary             The dictionary file.
 --csv (-c)                               Output the Vector as
CSV. Otherwise
                                          it substitutes in the terms for
                                          vector cell entries
 --useKey (-u)                            If the Key is a vector, then dump
                                          that instead
 --printKey (-p)                          Print out the key as
well, delimited
                                          by a tab (or the value if
useKey is true)
 --sizeOnly (-sz)                         Dump only the size of the vector

----------------

This means you want to do:

./bin/mahout -s path_to_docTopics_output -o
path_you_want_to_write_text_output_to

and then just look in path_you_want_to_write_text_output_to, and it should
have
what you want.

  -jake

On Fri, Jul 8, 2011 at 6:16 AM, huaiyang gongzi <hu...@gmail.com>wrote:

> Thanks, Jake. But after typing  mahout  vectordump --help,  I got sth like
> this
>
> 11/07/08 09:14:25 ERROR vectors.VectorDumper: Exception
> org.apache.commons.cli2.OptionException: Unexpected --help while processing
> Options
>        at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
>        at
> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:100)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>        at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> Usage:
>
>  [--seqFile <seqFile> --output <output> --dictionaryType
> <dictionaryType>
> --dictionary <dictionary> --csv --useKey --printKey
> --sizeOnly]
> Options
>
>  --seqFile (-s) seqFile                   The Sequence File containing
> the
>
> Vectors
>  --output (-o) output                     The output file.  If not
> specified,
>                                           dumps to the
> console
>  --dictionaryType (-dt) dictionaryType    The dictionary file
> type
>
> (text|sequencefile)
>  --dictionary (-d) dictionary             The dictionary
> file.
>  --csv (-c)                               Output the Vector as CSV.
> Otherwise
>                                           it substitutes in the terms
> for
>                                           vector cell
> entries
>  --useKey (-u)                            If the Key is a vector, then
> dump
>                                           that
> instead
>  --printKey (-p)                          Print out the key as well,
> delimited
>                                           by a tab (or the value if useKey
> is
>
> true)
>  --sizeOnly (-sz)                         Dump only the size of the
> vector
> 11/07/08 09:14:25 INFO driver.MahoutDriver: Program took 30 ms
>
>
> On Thu, Jul 7, 2011 at 5:56 PM, Jake Mannix <ja...@gmail.com> wrote:
>
> > On Thu, Jul 7, 2011 at 5:53 PM, wine lover <wi...@gmail.com> wrote:
> >
> > > Dear All,
> > >
> > > After running LDA analysis, I got the docTopic file, which is a regular
> > > sequence-file. How to transfer it into a readable format? I searched
> > > vectordumper, or vectordump, but did not get any useful results, such
> as
> > > how
> > > to use it in command-line? Thanks.
> > >
> >
> > So you say you "searched vectordumper/vectordump", you mean you
> > looked through the code looking for it, or you used it and it didn't do
> > what
> > you wanted?
> >
> > If you're just not sure how to use it, try running "./bin/mahout" from
> your
> > distribution directory, with no arguments, and it will print out a bunch
> of
> > possible commands, one of which is vectordump.   If you try to run it
> > with no arguments, it will sadly exit silently, not telling you what the
> > usage is (this is a bug!), but if you try to give it an illegal argument,
> > like
> >
> > ./bin/mahout vectordump --help
> >
> > You'll see:
> > Usage:
> >
> >  [--seqFile <seqFile> --output <output> --dictionaryType <dictionaryType>
> >
> > --dictionary <dictionary> --csv --useKey --printKey --sizeOnly]
> >
> > Options
> >
> >  --seqFile (-s) seqFile                   The Sequence File containing
> the
> >
> >                                           Vectors
> >
> >  --output (-o) output                     The output file.  If not
> > specified,
> >                                           dumps to the console
> >
> >  --dictionaryType (-dt) dictionaryType    The dictionary file type
> >
> >                                           (text|sequencefile)
> >
> >  --dictionary (-d) dictionary             The dictionary file.
> >
> >  --csv (-c)                               Output the Vector as CSV.
> >  Otherwise
> >                                           it substitutes in the terms for
> >
> >                                           vector cell entries
> >
> >  --useKey (-u)                            If the Key is a vector, then
> dump
> >
> >                                           that instead
> >
> >  --printKey (-p)                          Print out the key as well,
> > delimited
> >                                           by a tab (or the value if
> useKey
> > is
> >                                           true)
> >
> >  --sizeOnly (-sz)                         Dump only the size of the
> vector
> >
> >
> > -----
> >
> > If you use these instructions to point to the docTopics output location,
> > you can have it print out the p(topic | document) for each topic/document
> > pair in your collection.
> >
> >  -jake
> >
>

Re: how to transfer the sequence file into readable format

Posted by huaiyang gongzi <hu...@gmail.com>.
Thanks, Jake. But after typing  mahout  vectordump --help,  I got sth like
this

11/07/08 09:14:25 ERROR vectors.VectorDumper: Exception
org.apache.commons.cli2.OptionException: Unexpected --help while processing
Options
        at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
        at
org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:100)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Usage:

 [--seqFile <seqFile> --output <output> --dictionaryType
<dictionaryType>
--dictionary <dictionary> --csv --useKey --printKey
--sizeOnly]
Options

  --seqFile (-s) seqFile                   The Sequence File containing
the

Vectors
  --output (-o) output                     The output file.  If not
specified,
                                           dumps to the
console
  --dictionaryType (-dt) dictionaryType    The dictionary file
type

(text|sequencefile)
  --dictionary (-d) dictionary             The dictionary
file.
  --csv (-c)                               Output the Vector as CSV.
Otherwise
                                           it substitutes in the terms
for
                                           vector cell
entries
  --useKey (-u)                            If the Key is a vector, then
dump
                                           that
instead
  --printKey (-p)                          Print out the key as well,
delimited
                                           by a tab (or the value if useKey
is

true)
  --sizeOnly (-sz)                         Dump only the size of the
vector
11/07/08 09:14:25 INFO driver.MahoutDriver: Program took 30 ms


On Thu, Jul 7, 2011 at 5:56 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Thu, Jul 7, 2011 at 5:53 PM, wine lover <wi...@gmail.com> wrote:
>
> > Dear All,
> >
> > After running LDA analysis, I got the docTopic file, which is a regular
> > sequence-file. How to transfer it into a readable format? I searched
> > vectordumper, or vectordump, but did not get any useful results, such as
> > how
> > to use it in command-line? Thanks.
> >
>
> So you say you "searched vectordumper/vectordump", you mean you
> looked through the code looking for it, or you used it and it didn't do
> what
> you wanted?
>
> If you're just not sure how to use it, try running "./bin/mahout" from your
> distribution directory, with no arguments, and it will print out a bunch of
> possible commands, one of which is vectordump.   If you try to run it
> with no arguments, it will sadly exit silently, not telling you what the
> usage is (this is a bug!), but if you try to give it an illegal argument,
> like
>
> ./bin/mahout vectordump --help
>
> You'll see:
> Usage:
>
>  [--seqFile <seqFile> --output <output> --dictionaryType <dictionaryType>
>
> --dictionary <dictionary> --csv --useKey --printKey --sizeOnly]
>
> Options
>
>  --seqFile (-s) seqFile                   The Sequence File containing the
>
>                                           Vectors
>
>  --output (-o) output                     The output file.  If not
> specified,
>                                           dumps to the console
>
>  --dictionaryType (-dt) dictionaryType    The dictionary file type
>
>                                           (text|sequencefile)
>
>  --dictionary (-d) dictionary             The dictionary file.
>
>  --csv (-c)                               Output the Vector as CSV.
>  Otherwise
>                                           it substitutes in the terms for
>
>                                           vector cell entries
>
>  --useKey (-u)                            If the Key is a vector, then dump
>
>                                           that instead
>
>  --printKey (-p)                          Print out the key as well,
> delimited
>                                           by a tab (or the value if useKey
> is
>                                           true)
>
>  --sizeOnly (-sz)                         Dump only the size of the vector
>
>
> -----
>
> If you use these instructions to point to the docTopics output location,
> you can have it print out the p(topic | document) for each topic/document
> pair in your collection.
>
>  -jake
>

Re: how to transfer the sequence file into readable format

Posted by Jake Mannix <ja...@gmail.com>.
On Thu, Jul 7, 2011 at 5:53 PM, wine lover <wi...@gmail.com> wrote:

> Dear All,
>
> After running LDA analysis, I got the docTopic file, which is a regular
> sequence-file. How to transfer it into a readable format? I searched
> vectordumper, or vectordump, but did not get any useful results, such as
> how
> to use it in command-line? Thanks.
>

So you say you "searched vectordumper/vectordump", you mean you
looked through the code looking for it, or you used it and it didn't do what
you wanted?

If you're just not sure how to use it, try running "./bin/mahout" from your
distribution directory, with no arguments, and it will print out a bunch of
possible commands, one of which is vectordump.   If you try to run it
with no arguments, it will sadly exit silently, not telling you what the
usage is (this is a bug!), but if you try to give it an illegal argument,
like

./bin/mahout vectordump --help

You'll see:
Usage:

 [--seqFile <seqFile> --output <output> --dictionaryType <dictionaryType>

--dictionary <dictionary> --csv --useKey --printKey --sizeOnly]

Options

  --seqFile (-s) seqFile                   The Sequence File containing the

                                           Vectors

  --output (-o) output                     The output file.  If not
specified,
                                           dumps to the console

  --dictionaryType (-dt) dictionaryType    The dictionary file type

                                           (text|sequencefile)

  --dictionary (-d) dictionary             The dictionary file.

  --csv (-c)                               Output the Vector as CSV.
 Otherwise
                                           it substitutes in the terms for

                                           vector cell entries

  --useKey (-u)                            If the Key is a vector, then dump

                                           that instead

  --printKey (-p)                          Print out the key as well,
delimited
                                           by a tab (or the value if useKey
is
                                           true)

  --sizeOnly (-sz)                         Dump only the size of the vector


-----

If you use these instructions to point to the docTopics output location,
you can have it print out the p(topic | document) for each topic/document
pair in your collection.

  -jake