You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by "Scott C. Cote" <sc...@gmail.com> on 2014/01/22 00:30:02 UTC

Problem converting tokenized documents into TFIDF vectors

All,

Not a Mahout .9 problem ­ once I have this working with .8 Mahout, will
immediately pull in the .9 stuffŠ..

I am trying to make a small data set work (perhaps it is too small?) where I
am clustering skills (phrases).  For sake of brevity (my steps are long) , I
have not documented the steps that I took to get my text of skills into
tokenized formŠ.

By the time I get to the TFIDF vectors  (step 4) ­ my output is of zero Š.
No tfidf vectors generated.


I have broken this down into 4 steps.



Step 1. Tokenize docs.  Here is output validating success of tokenization.

mahout seqdumper -i tokenized-documents/part-m-00000

yields

Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.mahout.common.StringTuple
Key: 1: Value: [rest, web, services]
Key: 2: Value: [soa, design, build, service, oriented, architecture, using,
java]
Key: 3: Value: [oracle, jdbc, build, java, database, connectivity, layer,
oracle]
Key: 4: Value: [spring, injection, use, spring, templates, inversion,
control]
Key: 5: Value: [j2ee, create, device, enterprise, java, beans, integrate,
spring]
Key: 6: Value: [can, deploy, web, archive, war, files, tomcat]
Key: 7: Value: [java, graphics, uses, android, graphics, packages, create,
user, interfaces]
Key: 8: Value: [core, java, understand, core, libraries, java, development,
kit]
Key: 9: Value: [design, develop, jdbc, sql, queries]
Key: 10: Value: [multithreading, thread, synchronization]
Count: 10


Step 2. Create term frequency vectors from the tokenized sequence file (step
1).

mahout seqdumper -i dictionary.file-0

Yields

Key: java: Value: 0
Count: 1

mahout seqdumper -i tf-vectors/part-r-00000

Yields

Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.mahout.math.VectorWritable
Key: 2: Value: 2:{0:1.0}
Key: 3: Value: 3:{0:1.0}
Key: 5: Value: 5:{0:1.0}
Key: 7: Value: 7:{0:1.0}
Key: 8: Value: 8:{0:2.0}
Count: 5


Step 3. Create the document frequency data.

mahout seqdumper -i frequency.file-0

Yields

Key: 0: Value: 5
Count: 1

NOTE to READER:  Java is NOT the only common word ­ web occurs more than
once ­ how come its not included?





Step 4. Create the tfidf vectors: (can't remember if partials were created
in the past step)

mahout seqdumper -i partial-vectors-0/part-r-00000

yields

INFO: Command line arguments: {--endPhase=[2147483647],
--input=[part-r-00000], --startPhase=[0], --tempDir=[temp]}
2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from
SCDynamicStore
Input Path: part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.mahout.math.VectorWritable
Key: 2: Value: 2:{}
Key: 3: Value: 3:{}
Key: 5: Value: 5:{}
Key: 7: Value: 7:{}
Key: 8: Value: 8:{}
Count: 5

NOTE to READER:  What do the empty brackets mean here?


mahout seqdumper -i tfidf-vectors/part-r-00000

Yields

Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.mahout.math.VectorWritable
Count: 0

Why 0?

What am I NOT understanding here?

SCott



Re: Problem converting tokenized documents into TFIDF vectors

Posted by "Scott C. Cote" <sc...@gmail.com>.
I understand that it is not official.

Am just trying to provide another test opportunity for the .9 release.

SCott

On 1/26/14 1:05 PM, "Suneel Marthi" <su...@yahoo.com> wrote:

>Scott,
>
>FYI... 0.9 Release is not official yet. The project trunk's still at
>0.9-SNAPSHOT.
>
>Please feel free to update the documentation.
>
>
>
>
>
>
>On Sunday, January 26, 2014 1:34 PM, Scott C. Cote <sc...@gmail.com>
>wrote:
> 
>Drew,
>
>I'm sorry - I'm derelict (as opposed to dirichlet) in responding that I
>got passed my problem.
>
>It was the min freq that was killing me.  Forgot about that parameter.
>
>Thank you for your assist.
>
>Hope to be able to return the favor.
>
>Am on the hook to update documentation for Mahout already - maybe that
>will do it :)
>
>This week, I'll be testing my code against the .9 distribution.
>
>SCott
>
>
>On 1/26/14 10:57 AM, "Drew Farris" <dr...@apache.org> wrote:
>
>>Scott,
>>
>>Based on the dictionary output, it looks like the processing of
>>generating
>>vector from your tokenized text is not working properly. The only term
>>that's making it into your dictionary is 'java' - everything else is
>>being
>>filtered out. Furthermore, your tf vectors have a single dimension '0'
>>which a weight that corresponds to the frequency of the term 'java' in
>>each
>>document.
>>
>>I would check the settings for minimum document frequency in the
>>vectorization process. What is the command you are using to create
>>vectors
>>from your tokenized documents?
>>
>>Drew
>>
>>
>>On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote <sc...@gmail.com>
>>wrote:
>>
>>> All,
>>>
>>> Not a Mahout .9 problem ­ once I have this working with .8 Mahout, will
>>> immediately pull in the .9 stuffŠ..
>>>
>>> I am trying to make a small data set work (perhaps it is too small?)
>>>where
>>> I
>>> am clustering skills (phrases).  For sake of brevity (my steps are
>>>long) ,
>>> I
>>> have not documented the steps that I took to get my text of skills into
>>> tokenized formŠ.
>>>
>>> By the time I get to the TFIDF vectors  (step 4) ­ my output is of zero
>>>Š.
>>> No tfidf vectors generated.
>>>
>>>
>>> I have broken this down into 4 steps.
>>>
>>>
>>>
>>> Step 1. Tokenize docs.  Here is output validating success of
>>>tokenization.
>>>
>>> mahout seqdumper -i tokenized-documents/part-m-00000
>>>
>>> yields
>>>
>>> Key class: class org.apache.hadoop.io.Text Value Class: class
>>> org.apache.mahout.common.StringTuple
>>> Key: 1: Value: [rest, web, services]
>>> Key: 2: Value: [soa, design, build, service, oriented, architecture,
>>>using,
>>> java]
>>> Key: 3: Value: [oracle, jdbc, build, java, database, connectivity,
>>>layer,
>>> oracle]
>>> Key: 4: Value: [spring, injection, use, spring, templates, inversion,
>>> control]
>>> Key: 5: Value: [j2ee, create, device, enterprise, java, beans,
>>>integrate,
>>> spring]
>>> Key: 6: Value: [can, deploy, web, archive, war, files, tomcat]
>>> Key: 7: Value: [java, graphics, uses, android, graphics, packages,
>>>create,
>>> user, interfaces]
>>> Key: 8: Value: [core, java, understand, core, libraries, java,
>>>development,
>>> kit]
>>> Key: 9: Value: [design, develop, jdbc, sql, queries]
>>> Key: 10: Value: [multithreading, thread, synchronization]
>>> Count: 10
>>>
>>>
>>> Step 2. Create term frequency vectors from the tokenized sequence file
>>> (step
>>> 1).
>>>
>>> mahout seqdumper -i dictionary.file-0
>>>
>>> Yields
>>>
>>> Key: java: Value: 0
>>> Count: 1
>>>
>>> mahout seqdumper -i tf-vectors/part-r-00000
>>>
>>> Yields
>>>
>>> Key class: class org.apache.hadoop.io.Text Value Class: class
>>> org.apache.mahout.math.VectorWritable
>>> Key: 2: Value: 2:{0:1.0}
>>> Key: 3: Value: 3:{0:1.0}
>>> Key: 5: Value: 5:{0:1.0}
>>> Key: 7: Value: 7:{0:1.0}
>>> Key: 8: Value: 8:{0:2.0}
>>> Count: 5
>>>
>>>
>>> Step 3. Create the document frequency data.
>>>
>>> mahout seqdumper -i frequency.file-0
>>>
>>> Yields
>>>
>>> Key: 0: Value: 5
>>> Count: 1
>>>
>>> NOTE to READER:  Java is NOT the only common word ­ web occurs more
>>>than
>>> once ­ how come its not included?
>>>
>>>
>>>
>>>
>>>
>>> Step 4. Create the tfidf vectors: (can't remember if partials were
>>>created
>>> in the past step)
>>>
>>> mahout seqdumper -i partial-vectors-0/part-r-00000
>>>
>>> yields
>>>
>>> INFO: Command line arguments: {--endPhase=[2147483647],
>>> --input=[part-r-00000], --startPhase=[0], --tempDir=[temp]}
>>> 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from
>>> SCDynamicStore
>>> Input Path: part-r-00000
>>> Key class: class org.apache.hadoop.io.Text Value Class: class
>>> org.apache.mahout.math.VectorWritable
>>> Key: 2: Value: 2:{}
>>> Key: 3: Value: 3:{}
>>> Key: 5: Value: 5:{}
>>> Key: 7: Value: 7:{}
>>> Key: 8: Value: 8:{}
>>> Count: 5
>>>
>>> NOTE to READER:  What do the empty brackets mean here?
>>>
>>>
>>> mahout seqdumper -i tfidf-vectors/part-r-00000
>>>
>>> Yields
>>>
>>> Key class: class org.apache.hadoop.io.Text Value Class: class
>>> org.apache.mahout.math.VectorWritable
>>> Count: 0
>>>
>>> Why 0?
>>>
>>> What am I NOT understanding here?
>>>
>>> SCott
>>>
>>>



Re: Problem converting tokenized documents into TFIDF vectors

Posted by Suneel Marthi <su...@yahoo.com>.
Scott,

FYI... 0.9 Release is not official yet. The project trunk's still at 0.9-SNAPSHOT.

Please feel free to update the documentation.






On Sunday, January 26, 2014 1:34 PM, Scott C. Cote <sc...@gmail.com> wrote:
 
Drew,

I'm sorry - I'm derelict (as opposed to dirichlet) in responding that I
got passed my problem.

It was the min freq that was killing me.  Forgot about that parameter.

Thank you for your assist.

Hope to be able to return the favor.

Am on the hook to update documentation for Mahout already - maybe that
will do it :)

This week, I'll be testing my code against the .9 distribution.

SCott


On 1/26/14 10:57 AM, "Drew Farris" <dr...@apache.org> wrote:

>Scott,
>
>Based on the dictionary output, it looks like the processing of generating
>vector from your tokenized text is not working properly. The only term
>that's making it into your dictionary is 'java' - everything else is being
>filtered out. Furthermore, your tf vectors have a single dimension '0'
>which a weight that corresponds to the frequency of the term 'java' in
>each
>document.
>
>I would check the settings for minimum document frequency in the
>vectorization process. What is the command you are using to create vectors
>from your tokenized documents?
>
>Drew
>
>
>On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote <sc...@gmail.com>
>wrote:
>
>> All,
>>
>> Not a Mahout .9 problem ­ once I have this working with .8 Mahout, will
>> immediately pull in the .9 stuffŠ..
>>
>> I am trying to make a small data set work (perhaps it is too small?)
>>where
>> I
>> am clustering skills (phrases).  For sake of brevity (my steps are
>>long) ,
>> I
>> have not documented the steps that I took to get my text of skills into
>> tokenized formŠ.
>>
>> By the time I get to the TFIDF vectors  (step 4) ­ my output is of zero
>>Š.
>> No tfidf vectors generated.
>>
>>
>> I have broken this down into 4 steps.
>>
>>
>>
>> Step 1. Tokenize docs.  Here is output validating success of
>>tokenization.
>>
>> mahout seqdumper -i tokenized-documents/part-m-00000
>>
>> yields
>>
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.mahout.common.StringTuple
>> Key: 1: Value: [rest, web, services]
>> Key: 2: Value: [soa, design, build, service, oriented, architecture,
>>using,
>> java]
>> Key: 3: Value: [oracle, jdbc, build, java, database, connectivity,
>>layer,
>> oracle]
>> Key: 4: Value: [spring, injection, use, spring, templates, inversion,
>> control]
>> Key: 5: Value: [j2ee, create, device, enterprise, java, beans,
>>integrate,
>> spring]
>> Key: 6: Value: [can, deploy, web, archive, war, files, tomcat]
>> Key: 7: Value: [java, graphics, uses, android, graphics, packages,
>>create,
>> user, interfaces]
>> Key: 8: Value: [core, java, understand, core, libraries, java,
>>development,
>> kit]
>> Key: 9: Value: [design, develop, jdbc, sql, queries]
>> Key: 10: Value: [multithreading, thread, synchronization]
>> Count: 10
>>
>>
>> Step 2. Create term frequency vectors from the tokenized sequence file
>> (step
>> 1).
>>
>> mahout seqdumper -i dictionary.file-0
>>
>> Yields
>>
>> Key: java: Value: 0
>> Count: 1
>>
>> mahout seqdumper -i tf-vectors/part-r-00000
>>
>> Yields
>>
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.mahout.math.VectorWritable
>> Key: 2: Value: 2:{0:1.0}
>> Key: 3: Value: 3:{0:1.0}
>> Key: 5: Value: 5:{0:1.0}
>> Key: 7: Value: 7:{0:1.0}
>> Key: 8: Value: 8:{0:2.0}
>> Count: 5
>>
>>
>> Step 3. Create the document frequency data.
>>
>> mahout seqdumper -i frequency.file-0
>>
>> Yields
>>
>> Key: 0: Value: 5
>> Count: 1
>>
>> NOTE to READER:  Java is NOT the only common word ­ web occurs more than
>> once ­ how come its not included?
>>
>>
>>
>>
>>
>> Step 4. Create the tfidf vectors: (can't remember if partials were
>>created
>> in the past step)
>>
>> mahout seqdumper -i partial-vectors-0/part-r-00000
>>
>> yields
>>
>> INFO: Command line arguments: {--endPhase=[2147483647],
>> --input=[part-r-00000], --startPhase=[0], --tempDir=[temp]}
>> 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from
>> SCDynamicStore
>> Input Path: part-r-00000
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.mahout.math.VectorWritable
>> Key: 2: Value: 2:{}
>> Key: 3: Value: 3:{}
>> Key: 5: Value: 5:{}
>> Key: 7: Value: 7:{}
>> Key: 8: Value: 8:{}
>> Count: 5
>>
>> NOTE to READER:  What do the empty brackets mean here?
>>
>>
>> mahout seqdumper -i tfidf-vectors/part-r-00000
>>
>> Yields
>>
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.mahout.math.VectorWritable
>> Count: 0
>>
>> Why 0?
>>
>> What am I NOT understanding here?
>>
>> SCott
>>
>>
>>

Re: Problem converting tokenized documents into TFIDF vectors

Posted by "Scott C. Cote" <sc...@gmail.com>.
Drew,

I'm sorry - I'm derelict (as opposed to dirichlet) in responding that I
got passed my problem.

It was the min freq that was killing me.  Forgot about that parameter.

Thank you for your assist.

Hope to be able to return the favor.

Am on the hook to update documentation for Mahout already - maybe that
will do it :)

This week, I'll be testing my code against the .9 distribution.

SCott

On 1/26/14 10:57 AM, "Drew Farris" <dr...@apache.org> wrote:

>Scott,
>
>Based on the dictionary output, it looks like the processing of generating
>vector from your tokenized text is not working properly. The only term
>that's making it into your dictionary is 'java' - everything else is being
>filtered out. Furthermore, your tf vectors have a single dimension '0'
>which a weight that corresponds to the frequency of the term 'java' in
>each
>document.
>
>I would check the settings for minimum document frequency in the
>vectorization process. What is the command you are using to create vectors
>from your tokenized documents?
>
>Drew
>
>
>On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote <sc...@gmail.com>
>wrote:
>
>> All,
>>
>> Not a Mahout .9 problem ­ once I have this working with .8 Mahout, will
>> immediately pull in the .9 stuffŠ..
>>
>> I am trying to make a small data set work (perhaps it is too small?)
>>where
>> I
>> am clustering skills (phrases).  For sake of brevity (my steps are
>>long) ,
>> I
>> have not documented the steps that I took to get my text of skills into
>> tokenized formŠ.
>>
>> By the time I get to the TFIDF vectors  (step 4) ­ my output is of zero
>>Š.
>> No tfidf vectors generated.
>>
>>
>> I have broken this down into 4 steps.
>>
>>
>>
>> Step 1. Tokenize docs.  Here is output validating success of
>>tokenization.
>>
>> mahout seqdumper -i tokenized-documents/part-m-00000
>>
>> yields
>>
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.mahout.common.StringTuple
>> Key: 1: Value: [rest, web, services]
>> Key: 2: Value: [soa, design, build, service, oriented, architecture,
>>using,
>> java]
>> Key: 3: Value: [oracle, jdbc, build, java, database, connectivity,
>>layer,
>> oracle]
>> Key: 4: Value: [spring, injection, use, spring, templates, inversion,
>> control]
>> Key: 5: Value: [j2ee, create, device, enterprise, java, beans,
>>integrate,
>> spring]
>> Key: 6: Value: [can, deploy, web, archive, war, files, tomcat]
>> Key: 7: Value: [java, graphics, uses, android, graphics, packages,
>>create,
>> user, interfaces]
>> Key: 8: Value: [core, java, understand, core, libraries, java,
>>development,
>> kit]
>> Key: 9: Value: [design, develop, jdbc, sql, queries]
>> Key: 10: Value: [multithreading, thread, synchronization]
>> Count: 10
>>
>>
>> Step 2. Create term frequency vectors from the tokenized sequence file
>> (step
>> 1).
>>
>> mahout seqdumper -i dictionary.file-0
>>
>> Yields
>>
>> Key: java: Value: 0
>> Count: 1
>>
>> mahout seqdumper -i tf-vectors/part-r-00000
>>
>> Yields
>>
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.mahout.math.VectorWritable
>> Key: 2: Value: 2:{0:1.0}
>> Key: 3: Value: 3:{0:1.0}
>> Key: 5: Value: 5:{0:1.0}
>> Key: 7: Value: 7:{0:1.0}
>> Key: 8: Value: 8:{0:2.0}
>> Count: 5
>>
>>
>> Step 3. Create the document frequency data.
>>
>> mahout seqdumper -i frequency.file-0
>>
>> Yields
>>
>> Key: 0: Value: 5
>> Count: 1
>>
>> NOTE to READER:  Java is NOT the only common word ­ web occurs more than
>> once ­ how come its not included?
>>
>>
>>
>>
>>
>> Step 4. Create the tfidf vectors: (can't remember if partials were
>>created
>> in the past step)
>>
>> mahout seqdumper -i partial-vectors-0/part-r-00000
>>
>> yields
>>
>> INFO: Command line arguments: {--endPhase=[2147483647],
>> --input=[part-r-00000], --startPhase=[0], --tempDir=[temp]}
>> 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from
>> SCDynamicStore
>> Input Path: part-r-00000
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.mahout.math.VectorWritable
>> Key: 2: Value: 2:{}
>> Key: 3: Value: 3:{}
>> Key: 5: Value: 5:{}
>> Key: 7: Value: 7:{}
>> Key: 8: Value: 8:{}
>> Count: 5
>>
>> NOTE to READER:  What do the empty brackets mean here?
>>
>>
>> mahout seqdumper -i tfidf-vectors/part-r-00000
>>
>> Yields
>>
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.mahout.math.VectorWritable
>> Count: 0
>>
>> Why 0?
>>
>> What am I NOT understanding here?
>>
>> SCott
>>
>>
>>



Re: Problem converting tokenized documents into TFIDF vectors

Posted by Drew Farris <dr...@apache.org>.
Scott,

Based on the dictionary output, it looks like the processing of generating
vector from your tokenized text is not working properly. The only term
that's making it into your dictionary is 'java' - everything else is being
filtered out. Furthermore, your tf vectors have a single dimension '0'
which a weight that corresponds to the frequency of the term 'java' in each
document.

I would check the settings for minimum document frequency in the
vectorization process. What is the command you are using to create vectors
from your tokenized documents?

Drew


On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote <sc...@gmail.com> wrote:

> All,
>
> Not a Mahout .9 problem ­ once I have this working with .8 Mahout, will
> immediately pull in the .9 stuffŠ..
>
> I am trying to make a small data set work (perhaps it is too small?) where
> I
> am clustering skills (phrases).  For sake of brevity (my steps are long) ,
> I
> have not documented the steps that I took to get my text of skills into
> tokenized formŠ.
>
> By the time I get to the TFIDF vectors  (step 4) ­ my output is of zero Š.
> No tfidf vectors generated.
>
>
> I have broken this down into 4 steps.
>
>
>
> Step 1. Tokenize docs.  Here is output validating success of tokenization.
>
> mahout seqdumper -i tokenized-documents/part-m-00000
>
> yields
>
> Key class: class org.apache.hadoop.io.Text Value Class: class
> org.apache.mahout.common.StringTuple
> Key: 1: Value: [rest, web, services]
> Key: 2: Value: [soa, design, build, service, oriented, architecture, using,
> java]
> Key: 3: Value: [oracle, jdbc, build, java, database, connectivity, layer,
> oracle]
> Key: 4: Value: [spring, injection, use, spring, templates, inversion,
> control]
> Key: 5: Value: [j2ee, create, device, enterprise, java, beans, integrate,
> spring]
> Key: 6: Value: [can, deploy, web, archive, war, files, tomcat]
> Key: 7: Value: [java, graphics, uses, android, graphics, packages, create,
> user, interfaces]
> Key: 8: Value: [core, java, understand, core, libraries, java, development,
> kit]
> Key: 9: Value: [design, develop, jdbc, sql, queries]
> Key: 10: Value: [multithreading, thread, synchronization]
> Count: 10
>
>
> Step 2. Create term frequency vectors from the tokenized sequence file
> (step
> 1).
>
> mahout seqdumper -i dictionary.file-0
>
> Yields
>
> Key: java: Value: 0
> Count: 1
>
> mahout seqdumper -i tf-vectors/part-r-00000
>
> Yields
>
> Key class: class org.apache.hadoop.io.Text Value Class: class
> org.apache.mahout.math.VectorWritable
> Key: 2: Value: 2:{0:1.0}
> Key: 3: Value: 3:{0:1.0}
> Key: 5: Value: 5:{0:1.0}
> Key: 7: Value: 7:{0:1.0}
> Key: 8: Value: 8:{0:2.0}
> Count: 5
>
>
> Step 3. Create the document frequency data.
>
> mahout seqdumper -i frequency.file-0
>
> Yields
>
> Key: 0: Value: 5
> Count: 1
>
> NOTE to READER:  Java is NOT the only common word ­ web occurs more than
> once ­ how come its not included?
>
>
>
>
>
> Step 4. Create the tfidf vectors: (can't remember if partials were created
> in the past step)
>
> mahout seqdumper -i partial-vectors-0/part-r-00000
>
> yields
>
> INFO: Command line arguments: {--endPhase=[2147483647],
> --input=[part-r-00000], --startPhase=[0], --tempDir=[temp]}
> 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from
> SCDynamicStore
> Input Path: part-r-00000
> Key class: class org.apache.hadoop.io.Text Value Class: class
> org.apache.mahout.math.VectorWritable
> Key: 2: Value: 2:{}
> Key: 3: Value: 3:{}
> Key: 5: Value: 5:{}
> Key: 7: Value: 7:{}
> Key: 8: Value: 8:{}
> Count: 5
>
> NOTE to READER:  What do the empty brackets mean here?
>
>
> mahout seqdumper -i tfidf-vectors/part-r-00000
>
> Yields
>
> Key class: class org.apache.hadoop.io.Text Value Class: class
> org.apache.mahout.math.VectorWritable
> Count: 0
>
> Why 0?
>
> What am I NOT understanding here?
>
> SCott
>
>
>