You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Gregory Lawrence <gr...@yahoo-inc.com> on 2009/11/13 02:57:10 UTC

Sequence file format for Kmeans, LDA, etc.

Hi,

I'm trying to write a map-reduce program that will convert text documents into a format suitable for Mahout's clustering algorithms. From what I can gather, it seems like the output should be a sequence file with a long integer document index (key) and a sparse vector (value) that contains TF (or TFIDF) counts. This sparse vector also has a name that identifies the document.

Does the long integer document index matter? I would rather avoid having to set this to something meaningful. Do the numbers have to be unique or contiguous? Does the name of the sparse vector matter? I noticed that it is being set as a string in LuceneIterable.

Re: Sequence file format for Kmeans, LDA, etc.

Posted by Ted Dunning <te...@gmail.com>.

I am working on them.  They are new to the process.

On Fri, Nov 13, 2009 at 4:59 PM, Jake Mannix <ja...@gmail.com> wrote:

> You should get your so-far-silent friends to join in the conversation. :)




-- 
Ted Dunning, CTO
DeepDyve

Re: Sequence file format for Kmeans, LDA, etc.

Posted by Jake Mannix <ja...@gmail.com>.

On Fri, Nov 13, 2009 at 4:15 PM, Ted Dunning <te...@gmail.com> wrote:

> Jake,
>
> Can you post a preview on the appropriate JIRA.  I have been in contact
> with
> a group of so-far-silent proto-contributors who are very interested in
> these
> algorithms.
>

Yeah, now that the consensus appears to be that we will be keeping our
current
linear interfaces intact (with maybe some modifications), and the question
about
linear primitives revolves around *implementation* only, I can finish the
patches
I had started before - since I know what target interfaces I'm aiming
towards.

I'll try to attach a patch in the next week or so, as well as a link to the
appropriate
clone on github for people who prefer working with it that way.

You should get your so-far-silent friends to join in the conversation. :)

  -jake

Re: Sequence file format for Kmeans, LDA, etc.

Posted by Ted Dunning <te...@gmail.com>.

Jake,

Can you post a preview on the appropriate JIRA.  I have been in contact with
a group of so-far-silent proto-contributors who are very interested in these
algorithms.

On Fri, Nov 13, 2009 at 1:26 PM, Jake Mannix <ja...@gmail.com> wrote:

> Decomposer (in the process of donating, just gotta choose what linear
> primitives to convert to!) has a DistributedMatrix which does this for the
> already-parsed-into SequenceFIle's of Writable Vectors, and I really
> like this kind of interface.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Sequence file format for Kmeans, LDA, etc.

Posted by Jake Mannix <ja...@gmail.com>.

Decomposer (in the process of donating, just gotta choose what linear
primitives to convert to!) has a DistributedMatrix which does this for the
already-parsed-into SequenceFIle's of Writable Vectors, and I really
like this kind of interface.

Doing things like DistributedMatrix HdfsInputTextMatrix.extractTfIdfCorpus()
where this method sets up and runs a M/R job on a remote cluster, with the
output also living on HDFS, and the handle you have can now do all the
things which a Matrix impl can do... this kind of thing makes using the code
much less like scripting some procedural Jobs, and more like actual OO
programming.

  -jake

On Fri, Nov 13, 2009 at 1:15 PM, Ted Dunning <te...@gmail.com> wrote:

> This talk combined with previous talk about preferred mode of composing
> tools (script writing using java) is beginning to make me think that we
> need
> something like a HdfsMatrix and LocalFileMatrix which are simply wrappers
> around file names, but which allow extraction of elements (for debugging
> and
> diagnostics and sequential implementations) or for passing to generic
> driver
> routines or receiving from generic conversion routines.
>
> Should I open a JIRA?
>
> On Fri, Nov 13, 2009 at 11:54 AM, Grant Ingersoll <gsingers@apache.org
> >wrote:
>
> > Also, take a look at what the TfIdfDriver does for the classifier stuff.
> >  This is a M/R job for converting text for it's format.  I think we can
> > abstract that to be more general purpose and then move it under the Utils
> > module.  The only thing that likely needs to change is whether we output
> the
> > Writable for the classifier or whether we output a Vector.  That is my
> naive
> > view at this point.
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Sequence file format for Kmeans, LDA, etc.

Posted by Ted Dunning <te...@gmail.com>.

This talk combined with previous talk about preferred mode of composing
tools (script writing using java) is beginning to make me think that we need
something like a HdfsMatrix and LocalFileMatrix which are simply wrappers
around file names, but which allow extraction of elements (for debugging and
diagnostics and sequential implementations) or for passing to generic driver
routines or receiving from generic conversion routines.

Should I open a JIRA?

On Fri, Nov 13, 2009 at 11:54 AM, Grant Ingersoll <gs...@apache.org>wrote:

> Also, take a look at what the TfIdfDriver does for the classifier stuff.
>  This is a M/R job for converting text for it's format.  I think we can
> abstract that to be more general purpose and then move it under the Utils
> module.  The only thing that likely needs to change is whether we output the
> Writable for the classifier or whether we output a Vector.  That is my naive
> view at this point.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Sequence file format for Kmeans, LDA, etc.

Posted by Grant Ingersoll <gs...@apache.org>.

On Nov 12, 2009, at 8:57 PM, Gregory Lawrence wrote:

> Hi,
> 
> I'm trying to write a map-reduce program that will convert text documents into a format suitable for Mahout's clustering algorithms. From what I can gather, it seems like the output should be a sequence file with a long integer document index (key) and a sparse vector (value) that contains TF (or TFIDF) counts. This sparse vector also has a name that identifies the document.
> 
> Does the long integer document index matter?

No

> I would rather avoid having to set this to something meaningful. Do the numbers have to be unique or contiguous?

This is ignored in the clustering

> Does the name of the sparse vector matter?

Yes, as it is part of the equals() method.

> I noticed that it is being set as a string in LuceneIterable.

Right.  You should be able to model after LuceneIterable and the Driver program there.

Also, take a look at what the TfIdfDriver does for the classifier stuff.  This is a M/R job for converting text for it's format.  I think we can abstract that to be more general purpose and then move it under the Utils module.  The only thing that likely needs to change is whether we output the Writable for the classifier or whether we output a Vector.  That is my naive view at this point.

-Grant

Re: Sequence file format for Kmeans, LDA, etc.

Posted by Ted Dunning <te...@gmail.com>.

I don't think that the document index matters, but it is probably good to
have it have lots of different values if it doesn't have entirely distinct
values.

The name of the sparse vector is intended to provide you with traceability
back to your original data.

On Thu, Nov 12, 2009 at 5:57 PM, Gregory Lawrence <gr...@yahoo-inc.com>wrote:

>
> Does the long integer document index matter? I would rather avoid having to
> set this to something meaningful. Do the numbers have to be unique or
> contiguous? Does the name of the sparse vector matter? I noticed that it is
> being set as a string in LuceneIterable.
>

-- 
Ted Dunning, CTO
DeepDyve