You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Isabel Drost <is...@apache.org> on 2009/09/01 08:23:05 UTC

Re: String clustering and other newbie questions

On Mon, 31 Aug 2009 14:02:08 +0200
Juan Francisco Contreras Gaitan <ju...@hotmail.com> wrote:

> Thank you very much for your answer, but I think I can't understand
> it very well. Could you give me some more details?

Taking up that question, Ted, please correct me anywhere where I'm
wrong.

> For example, what does 'DP' stand for?

DP stands for Dirichlet Process, sometimes also referred to as "chinese
restaurant process". There is a nice wikipedia page on dirichlet
processes themselves: http://en.wikipedia.org/wiki/Dirichlet_process

An explanation of how they were employed to implement a clustering
algorithm in Mahout is explained on one of our wiki pages (including
references to the original papers):

http://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html

> You can see an example of what I would like to
> do in my previous answer.

In a k-Means like setup, you would implement your own distance
(Levenstein in your case) and use that to assign items to clusters
during the E(stimation)-step. After that you would employ your own
implementation of a centroid selection algorithm for recomputing
cluster-centroids during the M(aximisation)-step.

In a DP like setup it would look a little different: During the E step
instead of having k cluster centers, computing distances to these
clusters and doing hard assignments you would have k cluster models
and compute a probability of the strings being generated by each
model. During the M step you would then recompute each cluster model
based how likely each string was found to be generated by that model.
To arrive at a final assignment, after the assignment probabilities
become stable you could choose to assign each point to the model with
highest probability.

Isabel

Re: String clustering and other newbie questions

Posted by Ted Dunning <te...@gmail.com>.

The k-means implementation has the idea of distance between vectors of real
numbers pretty deeply baked into it.  One example of this is that it assumes
that you can take the average (aka centroid) of a set of examples.  Taking
the average of a set of strings in the sense of Levenstein distance would be
difficult.

There is an alternative algorithm called k-medoids which uses on of the
input samples as the centroid, but I would expect that this would give poor
results with Levenstein distance.

It would however, be very reasonable to use bigrams or trigrams as labels on
vector coordinates.  The vector value of a string would be derived by
weighting each bigram or trigram according to the negative log of the
prevalence of that bigram or trigram in your entire corpus.  This
representation would be highly amenable to k-means clustering.  Results
should be relatively good, although inspection of the centroids is likely to
be a bit confusing.

On Tue, Sep 1, 2009 at 5:06 AM, Juan Francisco Contreras Gaitan <
juanfcocontreras@hotmail.com> wrote:

> But if I understood you well, and as far as I know, Mahout has its own
> k-means implementation. Then, could I use it for my purposes instead of DP
> like setup?

-- 
Ted Dunning, CTO
DeepDyve

Re: String clustering and other newbie questions

Posted by Ted Dunning <te...@gmail.com>.

And there is a close correlation between n-gram matching scores and edit
distance scores.

On Tue, Sep 1, 2009 at 11:44 AM, Sean Owen <sr...@gmail.com> wrote:

> Yeah that probably kills the idea doesn't it... the 'best' centroid is well
> defined this way, but, searching for it may be completely unreasonable. I
> see why counts doesn't have this problem.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: String clustering and other newbie questions

Posted by Sean Owen <sr...@gmail.com>.

Yeah that probably kills the idea doesn't it... the 'best' centroid is well
defined this way, but, searching for it may be completely unreasonable. I
see why counts doesn't have this problem.

On Sep 1, 2009 7:17 PM, "Ted Dunning" <te...@gmail.com> wrote:

On Tue, Sep 1, 2009 at 9:44 AM, Sean Owen <sr...@gmail.com> wrote: >
Centroids are just strings th...
Easy to say that.

Very hard to compute.  And the dimensionality is unbounded so the properties
of the centroid are not nice.  You wind up with centroids that are a large
number of edits away from everything and nearly the same distance from
everything.

> ...

> > Anything else that doesn't map? Haven't thought about it a lot but don't
> yet > see why k-means...
Depends on what you mean by well-behaved.  Mathematically speaking, string
edit measures are moderately well behaved.  Computationally and practically,
however, edit distances are not so nice.

Counts of common n-grams are much nicer since they can be interpreted as
vectors.

--
Ted Dunning, CTO
DeepDyve

Re: String clustering and other newbie questions

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Sep 1, 2009 at 9:44 AM, Sean Owen <sr...@gmail.com> wrote:

> Centroids are just strings that are a similar number of edits away from
> another set of strings.
>

Easy to say that.

Very hard to compute.  And the dimensionality is unbounded so the properties
of the centroid are not nice.  You wind up with centroids that are a large
number of edits away from everything and nearly the same distance from
everything.

> ...
>
> Anything else that doesn't map? Haven't thought about it a lot but don't
> yet
> see why k-means couldn't let you cluster strings. In the CF code I do
> something similar for arbitrary 'items' so that hints to me that a well
> behaved distance metric is all you need?
>

Depends on what you mean by well-behaved.  Mathematically speaking, string
edit measures are moderately well behaved.  Computationally and practically,
however, edit distances are not so nice.

Counts of common n-grams are much nicer since they can be interpreted as
vectors.

-- 
Ted Dunning, CTO
DeepDyve

Re: String clustering and other newbie questions

Posted by Sean Owen <sr...@gmail.com>.

If I may attempt to clarify I think - indeed, it makes no sense to have a
vector whose elements are 'string valued', nor can I think of any mapping to
doubles that has any use here.

What he is really after is clustering strings like they are vectors
themselves, not elements of another vector. The question is, how much do we
need to be able to think of strings like vectors to make the algorithm work?

We need a distance metric and he's suggesting Levenshtein, which seems OK at
first glance. (It satisfied the triangle inequality ... I think?)

Centroids are just strings that are a similar number of edits away from
another set of strings.

Distances are discrete, does that matter though?

Anything else that doesn't map? Haven't thought about it a lot but don't yet
see why k-means couldn't let you cluster strings. In the CF code I do
something similar for arbitrary 'items' so that hints to me that a well
behaved distance metric is all you need?

Of course, the code wouldn't quite work as-is to perform this. One would
need to probably modify it a lot.

For what it is worth... you could actually get the TreeClusteringRecommender
class to cluster you strings with just a little work. I am not sure if it
implements the algorithm you want. It is also not distributed.

Sean

On Sep 1, 2009 5:14 PM, "Ted Dunning" <te...@gmail.com> wrote:

That particular trick wouldn't work because you are losing the essence of
real numbers with this step.  If 1.0 refers to one string and 2.0 refers to
another, what does 1.5 refer to?

Better to use trigrams as the labels for the coordinates and weight them by
inverse document frequency.

On Tue, Sep 1, 2009 at 6:28 AM, Juan Francisco Contreras Gaitan <
juanfcocontreras@hotmail.com> wrote:

>
> ... I could use a Map between doubles and strings: storaging doubles in
all

> the algorithm, and retrieving the strings to compute distance in measuring
> steps. >

Re: String clustering and other newbie questions

Posted by Ted Dunning <te...@gmail.com>.

That particular trick wouldn't work because you are losing the essence of
real numbers with this step.  If 1.0 refers to one string and 2.0 refers to
another, what does 1.5 refer to?

Better to use trigrams as the labels for the coordinates and weight them by
inverse document frequency.

On Tue, Sep 1, 2009 at 6:28 AM, Juan Francisco Contreras Gaitan <
juanfcocontreras@hotmail.com> wrote:

>
> ... I could use a Map between doubles and strings: storaging doubles in all
> the algorithm, and retrieving the strings to compute distance in measuring
> steps.
>

RE: String clustering and other newbie questions

Posted by Juan Francisco Contreras Gaitan <ju...@hotmail.com>.

Well, I have reread Ted answer after having a look at some of the information Isabel gave me, and I think you are right. But I am not sure about the reason  k-means mahout algorithm cannot be used with strings, after defining a string distance metric. Taking Jeff's advice, I could use a Map between doubles and strings: storaging doubles in all the algorithm, and retrieving the strings to compute distance in measuring steps. Could it make any sense?

Regards,
jfcg

> Subject: Re: String clustering and other newbie questions
> From: gsingers@apache.org
> Date: Tue, 1 Sep 2009 05:33:34 -0700
> To: mahout-user@lucene.apache.org
> 
> 
> On Sep 1, 2009, at 5:06 AM, Juan Francisco Contreras Gaitan wrote:
> 
> >
> > Ok, I see. Sorry for my unknowledge on these matters (I am going to  
> > read all the documentation you gave me closely).
> >
> > But if I understood you well, and as far as I know, Mahout has its  
> > own k-means implementation. Then, could I use it for my purposes  
> > instead of DP like setup?
> 
> I think Ted was saying that DP is the only one that would work for  
> what you described, but it's also possible we aren't understanding the  
> problem right either.
> 
> Obviously, one of the things we as a project need to develop more is  
> guidelines on which approaches work for which types of problems..
> 
> -Grant

_________________________________________________________________
Con Vodafone disfruta de Hotmail gratis en tu móvil. ¡Pruébalo!
http://serviciosmoviles.es.msn.com/hotmail/vodafone.aspx

Re: String clustering and other newbie questions

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 1, 2009, at 5:06 AM, Juan Francisco Contreras Gaitan wrote:

>
> Ok, I see. Sorry for my unknowledge on these matters (I am going to  
> read all the documentation you gave me closely).
>
> But if I understood you well, and as far as I know, Mahout has its  
> own k-means implementation. Then, could I use it for my purposes  
> instead of DP like setup?

I think Ted was saying that DP is the only one that would work for  
what you described, but it's also possible we aren't understanding the  
problem right either.

Obviously, one of the things we as a project need to develop more is  
guidelines on which approaches work for which types of problems..

-Grant

RE: String clustering and other newbie questions

Posted by Juan Francisco Contreras Gaitan <ju...@hotmail.com>.

Ok, I see. Sorry for my unknowledge on these matters (I am going to read all the documentation you gave me closely).

But if I understood you well, and as far as I know, Mahout has its own k-means implementation. Then, could I use it for my purposes instead of DP like setup?

Thank you very much, Isabel.

Regards,
jfcg

> Date: Tue, 1 Sep 2009 08:23:05 +0200
> From: isabel@apache.org
> To: mahout-user@lucene.apache.org
> Subject: Re: String clustering and other newbie questions
> 
> On Mon, 31 Aug 2009 14:02:08 +0200
> Juan Francisco Contreras Gaitan <ju...@hotmail.com> wrote:
> 
> > Thank you very much for your answer, but I think I can't understand
> > it very well. Could you give me some more details?
> 
> Taking up that question, Ted, please correct me anywhere where I'm
> wrong.
> 
> 
> > For example, what does 'DP' stand for?
> 
> DP stands for Dirichlet Process, sometimes also referred to as "chinese
> restaurant process". There is a nice wikipedia page on dirichlet
> processes themselves: http://en.wikipedia.org/wiki/Dirichlet_process
> 
> An explanation of how they were employed to implement a clustering
> algorithm in Mahout is explained on one of our wiki pages (including
> references to the original papers):
> 
> http://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html
> 
> 
> > You can see an example of what I would like to
> > do in my previous answer.
> 
> In a k-Means like setup, you would implement your own distance
> (Levenstein in your case) and use that to assign items to clusters
> during the E(stimation)-step. After that you would employ your own
> implementation of a centroid selection algorithm for recomputing
> cluster-centroids during the M(aximisation)-step.
> 
> In a DP like setup it would look a little different: During the E step
> instead of having k cluster centers, computing distances to these
> clusters and doing hard assignments you would have k cluster models
> and compute a probability of the strings being generated by each
> model. During the M step you would then recompute each cluster model
> based how likely each string was found to be generated by that model.
> To arrive at a final assignment, after the assignment probabilities
> become stable you could choose to assign each point to the model with
> highest probability.
> 
>  
> Isabel

_________________________________________________________________
Messenger cumple 10 años ¡Descárgate ya los nuevos emoticonos!
http://www.vivelive.com/felicidades