You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Paul Ingles <pa...@oobaloo.co.uk> on 2009/07/07 23:37:25 UTC

Document Clustering

Hi,

I've been doing some reading through the archives to search for some  
inspiration with a problem I've been attempting to solve at work, and  
was hoping I could share where my head's at and get some pointers for  
where to go. Since we're looking at clustering between 17m and 70m  
documents, we're looking to implement this in Hadoop.

We're trying to build clusters of (relatively small) documents, based  
on the words/terms they contain. Clusters will probably range in size  
between 10 and 1000 documents. Clusters should ultimately be populated  
by documents that contain almost identical terms (nothing clever like  
stemming done so far).

So far, I've been working down the pairwise similarity route. So,  
using a few MapReduce jobs we produce something along the lines of the  
following:

Da: [Db,0.1] [Dc,0.4]
Db: [Dc,0.5] [Df,0.9]
...
Dj:

With a row per-document, containing a vector of tuples for related  
documents, and a similarity score. Viewed another way, it's the  
typical matrix:

                A        B       C
-----+------+------+-----+
A    |          |   0.1 |   0.4
B    |          |          |   0.5

etc. A higher number means a more closely related document.

I've been trying to get my head around how to cluster these in a set  
of MapReduce jobs and I'm not quite sure of how to proceed: the  
examples I've read around kmeans, canopy clustering etc. all seem to  
work on multidimensional (numerical) data. Given the data above, is it  
even possible to adapt the algorithm? The potential centroids in the  
example above would be the documents, and I just can't get my head  
around applying the algorithms to this kind of data.

I guess the alternative would be to step back to producing a matrix of  
terms x documents:

             Ta     Tb     Tc
-----+------+------+-----+
Da  |          |   0.1 |   0.4
Db  |          |          |   0.5

And then cluster based on this? This seems similar in structure to the  
user x movie recommendation matrix that's often used?

I'd really appreciate people's thoughts- am I thinking the right way?  
Is there a better way?

Thanks,
Paul

Re: Document Clustering

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 7, 2009, at 6:10 PM, Ted Dunning wrote:

> Paul,
>
> You are doing a fine job so far.  Typically, I would recommend you  
> stick
> with the Document x Term matrix formulation.  That form would allow  
> you to
> use kmeans almost directly.
>
> The good news for you is that Grant is working on an extended  
> example to do
> just what you are looking to do.  I expect he will comment shortly.

On vacation...  I will try to find some time soon.

>
> On Tue, Jul 7, 2009 at 2:37 PM, Paul Ingles <pa...@oobaloo.co.uk>  
> wrote:
>
>> Hi,
>>
>> I've been doing some reading through the archives to search for some
>> inspiration with a problem I've been attempting to solve at work,  
>> and was
>> hoping I could share where my head's at and get some pointers for  
>> where to
>> go. Since we're looking at clustering between 17m and 70m  
>> documents, we're
>> looking to implement this in Hadoop.
>>
>> We're trying to build clusters of (relatively small) documents,  
>> based on
>> the words/terms they contain. Clusters will probably range in size  
>> between
>> 10 and 1000 documents. Clusters should ultimately be populated by  
>> documents
>> that contain almost identical terms (nothing clever like stemming  
>> done so
>> far).
>>
>> So far, I've been working down the pairwise similarity route. So,  
>> using a
>> few MapReduce jobs we produce something along the lines of the  
>> following:
>>
>> Da: [Db,0.1] [Dc,0.4]
>> Db: [Dc,0.5] [Df,0.9]
>> ...
>> Dj:
>>
>> With a row per-document, containing a vector of tuples for related
>> documents, and a similarity score. Viewed another way, it's the  
>> typical
>> matrix:
>>
>>              A        B       C
>> -----+------+------+-----+
>> A    |          |   0.1 |   0.4
>> B    |          |          |   0.5
>>
>> etc. A higher number means a more closely related document.
>>
>> I've been trying to get my head around how to cluster these in a  
>> set of
>> MapReduce jobs and I'm not quite sure of how to proceed: the  
>> examples I've
>> read around kmeans, canopy clustering etc. all seem to work on
>> multidimensional (numerical) data. Given the data above, is it even  
>> possible
>> to adapt the algorithm? The potential centroids in the example  
>> above would
>> be the documents, and I just can't get my head around applying the
>> algorithms to this kind of data.
>>
>> I guess the alternative would be to step back to producing a matrix  
>> of
>> terms x documents:
>>
>>           Ta     Tb     Tc
>> -----+------+------+-----+
>> Da  |          |   0.1 |   0.4
>> Db  |          |          |   0.5
>>
>> And then cluster based on this? This seems similar in structure to  
>> the user
>> x movie recommendation matrix that's often used?
>>
>> I'd really appreciate people's thoughts- am I thinking the right  
>> way? Is
>> there a better way?
>>
>> Thanks,
>> Paul
>>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve
>
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> http://www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Document Clustering

Posted by Ted Dunning <te...@gmail.com>.

Paul,

You are doing a fine job so far.  Typically, I would recommend you stick
with the Document x Term matrix formulation.  That form would allow you to
use kmeans almost directly.

The good news for you is that Grant is working on an extended example to do
just what you are looking to do.  I expect he will comment shortly.

On Tue, Jul 7, 2009 at 2:37 PM, Paul Ingles <pa...@oobaloo.co.uk> wrote:

> Hi,
>
> I've been doing some reading through the archives to search for some
> inspiration with a problem I've been attempting to solve at work, and was
> hoping I could share where my head's at and get some pointers for where to
> go. Since we're looking at clustering between 17m and 70m documents, we're
> looking to implement this in Hadoop.
>
> We're trying to build clusters of (relatively small) documents, based on
> the words/terms they contain. Clusters will probably range in size between
> 10 and 1000 documents. Clusters should ultimately be populated by documents
> that contain almost identical terms (nothing clever like stemming done so
> far).
>
> So far, I've been working down the pairwise similarity route. So, using a
> few MapReduce jobs we produce something along the lines of the following:
>
> Da: [Db,0.1] [Dc,0.4]
> Db: [Dc,0.5] [Df,0.9]
> ...
> Dj:
>
> With a row per-document, containing a vector of tuples for related
> documents, and a similarity score. Viewed another way, it's the typical
> matrix:
>
>               A        B       C
> -----+------+------+-----+
> A    |          |   0.1 |   0.4
> B    |          |          |   0.5
>
> etc. A higher number means a more closely related document.
>
> I've been trying to get my head around how to cluster these in a set of
> MapReduce jobs and I'm not quite sure of how to proceed: the examples I've
> read around kmeans, canopy clustering etc. all seem to work on
> multidimensional (numerical) data. Given the data above, is it even possible
> to adapt the algorithm? The potential centroids in the example above would
> be the documents, and I just can't get my head around applying the
> algorithms to this kind of data.
>
> I guess the alternative would be to step back to producing a matrix of
> terms x documents:
>
>            Ta     Tb     Tc
> -----+------+------+-----+
> Da  |          |   0.1 |   0.4
> Db  |          |          |   0.5
>
> And then cluster based on this? This seems similar in structure to the user
> x movie recommendation matrix that's often used?
>
> I'd really appreciate people's thoughts- am I thinking the right way? Is
> there a better way?
>
> Thanks,
> Paul
>



-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)