You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ole-Martin Mørk <ol...@gmail.com> on 2009/09/29 14:57:19 UTC

Using mahout to cluster terms in Lucene

Hi.
I have been using org.apache.mahout.utils.vectors.lucene.Driver
and org.apache.mahout.clustering.kmeans.KMeansDriver to cluster documents in
our Lucene index and it works great! I am wondering though, is it possible
to use Mahout to cluster terms?

I want to cluster terms that often appear in the same documents.

Thank you.

--
Ole-Martin Mørk
http://twitter.com/olemartin
http://flickr.com/olemartin

Re: Using mahout to cluster terms in Lucene

Posted by Ted Dunning <te...@gmail.com>.

Yes.  Transposing is exactly what I was suggesting but in the context of,
say, k-means.

LDA has the equivalent of U and V matrices laying around that should allow
clustering of terms and documents in the same space.  That is an interesting
thing to be able to do in any case.  The words give you a description of the
content and the documents give you examples.

On Tue, Sep 29, 2009 at 2:14 PM, Jake Mannix <ja...@gmail.com> wrote:

> Clustering documents by term (a la LDA or SVD) also leads to a nice
> clustering of terms by just looking at "the transpose", right?  This is
> literally the case for SVD: if M = U S V' is your SVD, where M is
> represented as a row matrix and U and V are column matrices (document by
> reduced-dimension and term by reduced dimension, respectively), then
> typically you just keep V and S around.  In this case the transpose of V
> has, as row vectors, the projection of each term onto the reduced
> dimensional space, and doing clustering on that set of reduced vectors
> performs "concept-aware" term clustering (and if you just want the system
> to
> run as a search engine [find me the top terms "close" to a given term], you
> just sort by descending dot-product on the rows of V).
>
> For our LDA implementation, I'm not sure, but given the set of all topics,
> just as each topic has a probability of producing a term, and so the
> transpose of this matrix has the probability of any given term being
> produced by each of the topics.  I'm not sure if our current implementation
> has methods you can easily use to get access to this information and
> thereby
> cluster the terms, however.
>
> On Tue, Sep 29, 2009 at 1:05 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
>
> > The LDA implementation kind of clusters on terms to generate topics.  It
> > sounds like you want some co-occurrence analysis, I'm not sure that the
> > clustering algorithms are best for that, but perhaps others have insight.
> >  I could imagine doing this with HBase or Pig and just keeping a matrix
> > where each cell kept track of the number of times both terms appear in a
> > document (or even within some window in a document).
> >
> >
> >
> > On Sep 29, 2009, at 8:57 AM, Ole-Martin Mørk wrote:
> >
> >  Hi.
> >> I have been using org.apache.mahout.utils.vectors.lucene.Driver
> >> and org.apache.mahout.clustering.kmeans.KMeansDriver to cluster
> documents
> >> in
> >> our Lucene index and it works great! I am wondering though, is it
> possible
> >> to use Mahout to cluster terms?
> >>
> >> I want to cluster terms that often appear in the same documents.
> >>
> >> Thank you.
> >>
> >> --
> >> Ole-Martin Mørk
> >> http://twitter.com/olemartin
> >> http://flickr.com/olemartin
> >>
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> > Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Using mahout to cluster terms in Lucene

Posted by Jake Mannix <ja...@gmail.com>.

On Tue, Sep 29, 2009 at 2:14 PM, Jake Mannix <ja...@gmail.com> wrote:

> Clustering documents by term (a la LDA or SVD) also leads to a nice
> clustering of terms by just looking at "the transpose", right?

This should of course read "Clustering documents by *document* ... also
leads to a nice clustering of terms..."

  -jake

Re: Using mahout to cluster terms in Lucene

Posted by Jake Mannix <ja...@gmail.com>.

Clustering documents by term (a la LDA or SVD) also leads to a nice
clustering of terms by just looking at "the transpose", right?  This is
literally the case for SVD: if M = U S V' is your SVD, where M is
represented as a row matrix and U and V are column matrices (document by
reduced-dimension and term by reduced dimension, respectively), then
typically you just keep V and S around.  In this case the transpose of V
has, as row vectors, the projection of each term onto the reduced
dimensional space, and doing clustering on that set of reduced vectors
performs "concept-aware" term clustering (and if you just want the system to
run as a search engine [find me the top terms "close" to a given term], you
just sort by descending dot-product on the rows of V).

For our LDA implementation, I'm not sure, but given the set of all topics,
just as each topic has a probability of producing a term, and so the
transpose of this matrix has the probability of any given term being
produced by each of the topics.  I'm not sure if our current implementation
has methods you can easily use to get access to this information and thereby
cluster the terms, however.

On Tue, Sep 29, 2009 at 1:05 PM, Grant Ingersoll <gs...@apache.org>wrote:

> The LDA implementation kind of clusters on terms to generate topics.  It
> sounds like you want some co-occurrence analysis, I'm not sure that the
> clustering algorithms are best for that, but perhaps others have insight.
>  I could imagine doing this with HBase or Pig and just keeping a matrix
> where each cell kept track of the number of times both terms appear in a
> document (or even within some window in a document).
>
>
>
> On Sep 29, 2009, at 8:57 AM, Ole-Martin Mørk wrote:
>
>  Hi.
>> I have been using org.apache.mahout.utils.vectors.lucene.Driver
>> and org.apache.mahout.clustering.kmeans.KMeansDriver to cluster documents
>> in
>> our Lucene index and it works great! I am wondering though, is it possible
>> to use Mahout to cluster terms?
>>
>> I want to cluster terms that often appear in the same documents.
>>
>> Thank you.
>>
>> --
>> Ole-Martin Mørk
>> http://twitter.com/olemartin
>> http://flickr.com/olemartin
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Using mahout to cluster terms in Lucene

Posted by Jake Mannix <ja...@gmail.com>.

Heh.  What Ted said, but longer-winded.

On Tue, Sep 29, 2009 at 2:13 PM, Ted Dunning <te...@gmail.com> wrote:

> Another way to do this through the back door is to transpose the document
> set so that you have a list of documents for each term.  Index this and
> cluster it just as if it were normal documents and you will have a form of
> term clustering.
>
> On Tue, Sep 29, 2009 at 1:05 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
>
> > The LDA implementation kind of clusters on terms to generate topics.  It
> > sounds like you want some co-occurrence analysis, I'm not sure that the
> > clustering algorithms are best for that, but perhaps others have insight.
> >  I could imagine doing this with HBase or Pig and just keeping a matrix
> > where each cell kept track of the number of times both terms appear in a
> > document (or even within some window in a document).
> >
> >
> >
> > On Sep 29, 2009, at 8:57 AM, Ole-Martin Mørk wrote:
> >
> >  Hi.
> >> I have been using org.apache.mahout.utils.vectors.lucene.Driver
> >> and org.apache.mahout.clustering.kmeans.KMeansDriver to cluster
> documents
> >> in
> >> our Lucene index and it works great! I am wondering though, is it
> possible
> >> to use Mahout to cluster terms?
> >>
> >> I want to cluster terms that often appear in the same documents.
> >>
> >> Thank you.
> >>
> >> --
> >> Ole-Martin Mørk
> >> http://twitter.com/olemartin
> >> http://flickr.com/olemartin
> >>
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> > Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Using mahout to cluster terms in Lucene

Posted by Ted Dunning <te...@gmail.com>.

This is a great example.

On Wed, Sep 30, 2009 at 1:43 AM, Jake Mannix <ja...@gmail.com> wrote:

> It's not necessarily the case that if the nearest point to pointA in a
> collection of points is pointB, that the nearest point to pointB is pointA,
> right?
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Using mahout to cluster terms in Lucene

Posted by Jake Mannix <ja...@gmail.com>.

It's not necessarily the case that if the nearest point to pointA in a
collection of points is pointB, that the nearest point to pointB is pointA,
right?  Even in one dimension, if your three points are {0, 1, 1.1}, the
nearest point to 0 is 1, but the nearest point to 1 is 1.1.

I'm not sure if this invalidates your desire to have some sort of
conceptual hierarchy in your clustering, but just because metrics
are symmetric, doesn't mean that iterating nearest(nearest(...(A)...)
repeats quickly (it doesn't even need to converge).

  -jake

On Wed, Sep 30, 2009 at 12:42 AM, Shashikant Kore <sh...@gmail.com>wrote:

> Ted,
>
> Some time back I had thought about this idea. But, I sensed one
> potential problem with this approach. The resulting co-occurrence will
> be bi-directional. For document this property is fine, but for terms,
> it may not be desirable in some cases.
>
> For example, if "Roger Federer" is the keyword, the co-occuring terms
> will be "Tennis", "Grand slam", "Wimbledon", etc. But, for "Tennis",
> the list of top co-occurring terms may not include "Roger Federer."
>
> Is there a way to identify the directional relationship among terms?
>
> Of course, this was just a thought and no real code was written to
> verify the assertion.
>
> --shashi
>
> On Wed, Sep 30, 2009 at 2:43 AM, Ted Dunning <te...@gmail.com>
> wrote:
> > Another way to do this through the back door is to transpose the document
> > set so that you have a list of documents for each term.  Index this and
> > cluster it just as if it were normal documents and you will have a form
> of
> > term clustering.
> >
> > On Tue, Sep 29, 2009 at 1:05 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
> >
> >> The LDA implementation kind of clusters on terms to generate topics.  It
> >> sounds like you want some co-occurrence analysis, I'm not sure that the
> >> clustering algorithms are best for that, but perhaps others have
> insight.
> >>  I could imagine doing this with HBase or Pig and just keeping a matrix
> >> where each cell kept track of the number of times both terms appear in a
> >> document (or even within some window in a document).
> >>
> >>
> >>
> >> On Sep 29, 2009, at 8:57 AM, Ole-Martin Mørk wrote:
> >>
> >>  Hi.
> >>> I have been using org.apache.mahout.utils.vectors.lucene.Driver
> >>> and org.apache.mahout.clustering.kmeans.KMeansDriver to cluster
> documents
> >>> in
> >>> our Lucene index and it works great! I am wondering though, is it
> possible
> >>> to use Mahout to cluster terms?
> >>>
> >>> I want to cluster terms that often appear in the same documents.
> >>>
> >>> Thank you.
> >>>
> >>> --
> >>> Ole-Martin Mørk
> >>> http://twitter.com/olemartin
> >>> http://flickr.com/olemartin
> >>>
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> http://www.lucidimagination.com/
> >>
> >> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> >> Solr/Lucene:
> >> http://www.lucidimagination.com/search
> >>
> >>
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>

Re: Using mahout to cluster terms in Lucene

Posted by Ole-Martin Mørk <ol...@gmail.com>.

Thank you all. Great feedback.
--
Ole-Martin Mørk
http://twitter.com/olemartin
http://flickr.com/olemartin

Re: Using mahout to cluster terms in Lucene

Posted by Ted Dunning <te...@gmail.com>.

The cooccurrence counts themselves form a symmetric matrix.  (A'A)' = A'A
because of the way that matrix multiplication works.

The filtering for anomalous cooccurrence that sparsifies the cooccurrence
can introduce asymmetry as you point out.

The most prominent time that I saw this in practice was in music
recommendations where a fair number of artists linked to high profile bands
such as the Beatles, but the reverse link did not survive the filtering.
You can enforce bi-directionality, but I have usually found that the
asymmetry isn't a problem and often accords with intuitions about the field.

On Wed, Sep 30, 2009 at 12:42 AM, Shashikant Kore <sh...@gmail.com>wrote:

> Some time back I had thought about this idea. But, I sensed one
> potential problem with this approach. The resulting co-occurrence will
> be bi-directional. For document this property is fine, but for terms,
> it may not be desirable in some cases.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Using mahout to cluster terms in Lucene

Posted by Shashikant Kore <sh...@gmail.com>.

Ted,

Some time back I had thought about this idea. But, I sensed one
potential problem with this approach. The resulting co-occurrence will
be bi-directional. For document this property is fine, but for terms,
it may not be desirable in some cases.

For example, if "Roger Federer" is the keyword, the co-occuring terms
will be "Tennis", "Grand slam", "Wimbledon", etc. But, for "Tennis",
the list of top co-occurring terms may not include "Roger Federer."

Is there a way to identify the directional relationship among terms?

Of course, this was just a thought and no real code was written to
verify the assertion.

--shashi

On Wed, Sep 30, 2009 at 2:43 AM, Ted Dunning <te...@gmail.com> wrote:
> Another way to do this through the back door is to transpose the document
> set so that you have a list of documents for each term.  Index this and
> cluster it just as if it were normal documents and you will have a form of
> term clustering.
>
> On Tue, Sep 29, 2009 at 1:05 PM, Grant Ingersoll <gs...@apache.org>wrote:
>
>> The LDA implementation kind of clusters on terms to generate topics.  It
>> sounds like you want some co-occurrence analysis, I'm not sure that the
>> clustering algorithms are best for that, but perhaps others have insight.
>>  I could imagine doing this with HBase or Pig and just keeping a matrix
>> where each cell kept track of the number of times both terms appear in a
>> document (or even within some window in a document).
>>
>>
>>
>> On Sep 29, 2009, at 8:57 AM, Ole-Martin Mørk wrote:
>>
>>  Hi.
>>> I have been using org.apache.mahout.utils.vectors.lucene.Driver
>>> and org.apache.mahout.clustering.kmeans.KMeansDriver to cluster documents
>>> in
>>> our Lucene index and it works great! I am wondering though, is it possible
>>> to use Mahout to cluster terms?
>>>
>>> I want to cluster terms that often appear in the same documents.
>>>
>>> Thank you.
>>>
>>> --
>>> Ole-Martin Mørk
>>> http://twitter.com/olemartin
>>> http://flickr.com/olemartin
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Using mahout to cluster terms in Lucene

Posted by Ted Dunning <te...@gmail.com>.

Another way to do this through the back door is to transpose the document
set so that you have a list of documents for each term.  Index this and
cluster it just as if it were normal documents and you will have a form of
term clustering.

On Tue, Sep 29, 2009 at 1:05 PM, Grant Ingersoll <gs...@apache.org>wrote:

> The LDA implementation kind of clusters on terms to generate topics.  It
> sounds like you want some co-occurrence analysis, I'm not sure that the
> clustering algorithms are best for that, but perhaps others have insight.
>  I could imagine doing this with HBase or Pig and just keeping a matrix
> where each cell kept track of the number of times both terms appear in a
> document (or even within some window in a document).
>
>
>
> On Sep 29, 2009, at 8:57 AM, Ole-Martin Mørk wrote:
>
>  Hi.
>> I have been using org.apache.mahout.utils.vectors.lucene.Driver
>> and org.apache.mahout.clustering.kmeans.KMeansDriver to cluster documents
>> in
>> our Lucene index and it works great! I am wondering though, is it possible
>> to use Mahout to cluster terms?
>>
>> I want to cluster terms that often appear in the same documents.
>>
>> Thank you.
>>
>> --
>> Ole-Martin Mørk
>> http://twitter.com/olemartin
>> http://flickr.com/olemartin
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Ted Dunning, CTO
DeepDyve

Re: Using mahout to cluster terms in Lucene

Posted by Grant Ingersoll <gs...@apache.org>.

The LDA implementation kind of clusters on terms to generate topics.   
It sounds like you want some co-occurrence analysis, I'm not sure that  
the clustering algorithms are best for that, but perhaps others have  
insight.    I could imagine doing this with HBase or Pig and just  
keeping a matrix where each cell kept track of the number of times  
both terms appear in a document (or even within some window in a  
document).

On Sep 29, 2009, at 8:57 AM, Ole-Martin Mørk wrote:

> Hi.
> I have been using org.apache.mahout.utils.vectors.lucene.Driver
> and org.apache.mahout.clustering.kmeans.KMeansDriver to cluster  
> documents in
> our Lucene index and it works great! I am wondering though, is it  
> possible
> to use Mahout to cluster terms?
>
> I want to cluster terms that often appear in the same documents.
>
> Thank you.
>
> --
> Ole-Martin Mørk
> http://twitter.com/olemartin
> http://flickr.com/olemartin

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search