You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Katherine Huang <kh...@shopzilla.com> on 2012/01/26 03:52:39 UTC

seq2sparse generated dictionary is missing words

I am doing a trial run starting with a sequence file that contains: (this is from seqdumper and I just made my key the same as my value):

Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.Text
Key: first book nature specialword boxes: Value: first book nature specialword boxes
Key: fourth fake example with fake: Value: fourth fake example with fake
Key: second book fun: Value: second book fun
Key: third unique document item: Value: third unique document item
Key: fifth bag of words: Value: fifth bag of words
Count: 5


When I run
mahout seq2sparse -i /user/trial_01252012/processed_doc_trial/ -o /khuang/trial_01252012/keyword_Vectors_461_named -ow -md 1 -a org.apache.lucene.analysis.WhitespaceAnalyzer -wt tf -seq –nv

And I look dump tokenized vectors:
mahout seqdumper -s /user/trial_01252012/vec_named/tf-vectors/part-r-00000

I only have three of my 'orig' documents:

Input Path: /user/khuang/trial_01252012/vec_named/tf-vectors/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable
Key: first book nature specialword boxes: Value: org.apache.mahout.math.VectorWritable@e5d391d
Key: fourth fake example with fake: Value: org.apache.mahout.math.VectorWritable@e5d391d
Key: second book fun: Value: org.apache.mahout.math.VectorWritable@e5d391d
Count: 3


In addition, the dictionary is missing words. Is there a reason for this?




Re: seq2sparse generated dictionary is missing words

Posted by Katherine Huang <kh...@shopzilla.com>.
Vector dump doesn't seem to dump a key:text, value:vectorwritable


$ mahout dumpTxtVec -s
/user/khuang/trial_01252012/vec_named/tf-vectors/part-r-00000

Input Path: /user/trial_01252012/vec_named/tf-vectors/part-r-00000
Key class: first book nature specialword boxes  Value class:
org.apache.mahout.math.NamedVector@2
Key class: fourth fake example with fake  Value class:
org.apache.mahout.math.NamedVector@40000002
Key class: second book fun  Value class:
org.apache.mahout.math.NamedVector@2
12/01/25 19:18:33 INFO driver.MahoutDriver: Program took 351 ms




On 1/25/12 7:10 PM, "Suneel Marthi" <su...@yahoo.com> wrote:

>
>
>
>
>________________________________
> From: Katherine Huang <kh...@shopzilla.com>
>To: "user@mahout.apache.org" <us...@mahout.apache.org>
>Sent: Wednesday, January 25, 2012 9:52 PM
>Subject: seq2sparse generated dictionary is missing words
> 
>I am doing a trial run starting with a sequence file that contains: (this
>is from seqdumper and I just made my key the same as my value):
>
>Key class: class org.apache.hadoop.io.Text Value Class: class
>org.apache.hadoop.io.Text
>Key: first book nature specialword boxes: Value: first book nature
>specialword boxes
>Key: fourth fake example with fake: Value: fourth fake example with fake
>Key: second book fun: Value: second book fun
>Key: third unique document item: Value: third unique document item
>Key: fifth bag of words: Value: fifth bag of words
>Count: 5
>
>
>When I run
>mahout seq2sparse -i /user/trial_01252012/processed_doc_trial/ -o
>/khuang/trial_01252012/keyword_Vectors_461_named -ow -md 1 -a
>org.apache.lucene.analysis.WhitespaceAnalyzer -wt tf -seq ­nv
>
>And I look dump tokenized vectors:
>mahout seqdumper -s /user/trial_01252012/vec_named/tf-vectors/part-r-00000
>
>Did you mean to call vectordump to dump your vectors?
>
>I only have three of my 'orig' documents:
>
>Input Path: /user/khuang/trial_01252012/vec_named/tf-vectors/part-r-00000
>Key class: class org.apache.hadoop.io.Text Value Class: class
>org.apache.mahout.math.VectorWritable
>Key: first book nature specialword boxes: Value:
>org.apache.mahout.math.VectorWritable@e5d391d
>Key: fourth fake example with fake: Value:
>org.apache.mahout.math.VectorWritable@e5d391d
>Key: second book fun: Value: org.apache.mahout.math.VectorWritable@e5d391d
>Count: 3
>
>
>In addition, the dictionary is missing words. Is there a reason for this?


Re: seq2sparse generated dictionary is missing words

Posted by Suneel Marthi <su...@yahoo.com>.



________________________________
 From: Katherine Huang <kh...@shopzilla.com>
To: "user@mahout.apache.org" <us...@mahout.apache.org> 
Sent: Wednesday, January 25, 2012 9:52 PM
Subject: seq2sparse generated dictionary is missing words
 
I am doing a trial run starting with a sequence file that contains: (this is from seqdumper and I just made my key the same as my value):

Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.Text
Key: first book nature specialword boxes: Value: first book nature specialword boxes
Key: fourth fake example with fake: Value: fourth fake example with fake
Key: second book fun: Value: second book fun
Key: third unique document item: Value: third unique document item
Key: fifth bag of words: Value: fifth bag of words
Count: 5


When I run
mahout seq2sparse -i /user/trial_01252012/processed_doc_trial/ -o /khuang/trial_01252012/keyword_Vectors_461_named -ow -md 1 -a org.apache.lucene.analysis.WhitespaceAnalyzer -wt tf -seq –nv

And I look dump tokenized vectors:
mahout seqdumper -s /user/trial_01252012/vec_named/tf-vectors/part-r-00000

Did you mean to call vectordump to dump your vectors?

I only have three of my 'orig' documents:

Input Path: /user/khuang/trial_01252012/vec_named/tf-vectors/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable
Key: first book nature specialword boxes: Value: org.apache.mahout.math.VectorWritable@e5d391d
Key: fourth fake example with fake: Value: org.apache.mahout.math.VectorWritable@e5d391d
Key: second book fun: Value: org.apache.mahout.math.VectorWritable@e5d391d
Count: 3


In addition, the dictionary is missing words. Is there a reason for this?