You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2009/12/31 18:10:35 UTC

Clustering techniques, tips and tricks

As some of you may know, I'm working on a book (it's a long time coming, but I'm getting there) about open source techniques for working with text. One of my chapters is on clustering and in it, I want to talk about generic clustering approaches and then show concrete examples of them in action. I've got the concrete side of it down.

Based on my research, it seems people typically divide up the clustering space into two approaches: hierarchical and flat/partitioning. In overlaying that knowledge with what we have for techniques in Mahout, I'm a bit stumped about where things like LDA and Dirichlet fit into those two approaches or is there, perhaps a third that I'm missing? They don't seem particularly hierarchical but they don't seem flat either, if that makes any sense, given the probabilistic/mixture nature of the algorithms. Perhaps I should forgo the traditional division that previous authors have taken and just talk about a suite of techniques at a little lower level? Thoughts?

The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems. For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)? What algorithms worked for you? What didn't work? Any and all insight is welcome and I don't particularly care if it is Mahout specific (for instance, part of the chapter is about search result clustering using Carrot2 and so Mahout isn't applicable)

Thanks in advance and Happy New Year,
Grant

Re: Clustering techniques, tips and tricks

Posted by Dawid Weiss <da...@gmail.com>.

> The carrot2 implementation is among the best that I have used in terms of
> the quality of clustering small batches of results such as search result
> lists.

(blushing). Thanks, it's a big compliment coming from you, Ted.

Re: Clustering techniques, tips and tricks

Posted by Grant Ingersoll <gs...@apache.org>.

Sounds like we need some more wiki docs. 

-Grant

On Jan 5, 2010, at 10:08 PM, Bogdan Vatkov wrote:

> 10x, that was what I needed :)
> 
> On Wed, Jan 6, 2010 at 4:58 AM, Drew Farris <dr...@gmail.com> wrote:
> 
>> Take a look a o.a.m.clustering.ClusterDumper in mahout-utils. The
>> points file is a SequenceFile<Text,Text> where the key is the vector
>> id and the value is a cluster id.
>> 
>> On Tue, Jan 5, 2010 at 9:51 PM, Bogdan Vatkov <bo...@gmail.com>
>> wrote:
>>> I customized the lucene index-to-vector dumper already quite a lot (e.g.
>>> applied stop-words (from file), stop-regex) but I am wondering how the
>> input
>>> vectors are later reachable if I start from cluster vectors, you say
>> points
>>> are somehow doing that, where can I read more or can you tell me more, or
>> is
>>> there a piece of code which would best guide me through the points
>> format?
>> 
> 
> 
> 
> -- 
> Best regards,
> Bogdan

Re: Clustering techniques, tips and tricks

Posted by Bogdan Vatkov <bo...@gmail.com>.

10x, that was what I needed :)

On Wed, Jan 6, 2010 at 4:58 AM, Drew Farris <dr...@gmail.com> wrote:

> Take a look a o.a.m.clustering.ClusterDumper in mahout-utils. The
> points file is a SequenceFile<Text,Text> where the key is the vector
> id and the value is a cluster id.
>
> On Tue, Jan 5, 2010 at 9:51 PM, Bogdan Vatkov <bo...@gmail.com>
> wrote:
> > I customized the lucene index-to-vector dumper already quite a lot (e.g.
> > applied stop-words (from file), stop-regex) but I am wondering how the
> input
> > vectors are later reachable if I start from cluster vectors, you say
> points
> > are somehow doing that, where can I read more or can you tell me more, or
> is
> > there a piece of code which would best guide me through the points
> format?
>



-- 
Best regards,
Bogdan

Re: Clustering techniques, tips and tricks

Posted by Drew Farris <dr...@gmail.com>.

Take a look a o.a.m.clustering.ClusterDumper in mahout-utils. The
points file is a SequenceFile<Text,Text> where the key is the vector
id and the value is a cluster id.

On Tue, Jan 5, 2010 at 9:51 PM, Bogdan Vatkov <bo...@gmail.com> wrote:
> I customized the lucene index-to-vector dumper already quite a lot (e.g.
> applied stop-words (from file), stop-regex) but I am wondering how the input
> vectors are later reachable if I start from cluster vectors, you say points
> are somehow doing that, where can I read more or can you tell me more, or is
> there a piece of code which would best guide me through the points format?

Re: Clustering techniques, tips and tricks

Posted by Bogdan Vatkov <bo...@gmail.com>.

I customized the lucene index-to-vector dumper already quite a lot (e.g.
applied stop-words (from file), stop-regex) but I am wondering how the input
vectors are later reachable if I start from cluster vectors, you say points
are somehow doing that, where can I read more or can you tell me more, or is
there a piece of code which would best guide me through the points format?

On Wed, Jan 6, 2010 at 4:43 AM, Drew Farris <dr...@gmail.com> wrote:

> Each iteration of kmeans procuses a cluster-X folder, with X starting
> at 0. You would get clusters-0 in cases where the clusters converge
> after the first run.
>
> Whether your clusters will retain document id's is based on how you
> create the vectors. For example, the lucene vector dumper can be told
> to extract the value from a specific field in the index to use for the
> vector labels. These are carried through to the points file produced
> at the end of the k-means run.
>
> On Tue, Jan 5, 2010 at 9:36 PM, Bogdan Vatkov <bo...@gmail.com>
> wrote:
> > Is there some description of the content of the cluster vector?
> > I also noticed that I end up with some folders clusters-0 and clusters-1,
> > but sometimes it is only clusters-0, when do we get the different folders
> > and which should be used as end result - e.g. by the ClusterDumper?
>

-- 
Best regards,
Bogdan

Re: Clustering techniques, tips and tricks

Posted by Drew Farris <dr...@gmail.com>.

On Tue, Jan 5, 2010 at 9:43 PM, Drew Farris <dr...@gmail.com> wrote:
> Each iteration of kmeans procuses a cluster-X folder, with X starting
> at 0. You would get clusters-0 in cases where the clusters converge
> after the first run.

To correct/clarify: you get clusters-0 for every run -- you would get
clusters-0 and no other output directories for runs that converged in
a single iteration. I believe you always want to use the results in
the highest numbered cluster-X directory.

Re: Clustering techniques, tips and tricks

Posted by Drew Farris <dr...@gmail.com>.

Each iteration of kmeans procuses a cluster-X folder, with X starting
at 0. You would get clusters-0 in cases where the clusters converge
after the first run.

Whether your clusters will retain document id's is based on how you
create the vectors. For example, the lucene vector dumper can be told
to extract the value from a specific field in the index to use for the
vector labels. These are carried through to the points file produced
at the end of the k-means run.

On Tue, Jan 5, 2010 at 9:36 PM, Bogdan Vatkov <bo...@gmail.com> wrote:
> Is there some description of the content of the cluster vector?
> I also noticed that I end up with some folders clusters-0 and clusters-1,
> but sometimes it is only clusters-0, when do we get the different folders
> and which should be used as end result - e.g. by the ClusterDumper?

Re: Clustering techniques, tips and tricks

Posted by Pallavi Palleti <pa...@corp.aol.com>.

Clusters-i directory is for each iteration and points is the folder 
where you have the final output data in consumable format. For example, 
in FuzzyKMeans, the clusters-0 directory contains a format like
clustersid\tclusterVector as key value pair. This will be consumed by 
next iteration to read the centriods. Where as, the points directory 
contains data as itemVector\tclusterProbabilities. This gives you the 
item and the cluster probabilities (p(cluster/item) for this item.

Thanks
Pallavi

Bogdan Vatkov wrote:
> Is there a description of the output structure of the results, I see also
> some folders like points which is used by the ClusterDumper but I do not
> know the technical details.
> I would be interested what kind of data is available as a result of the
> clustering. Is it different when different algorithm is used (kmeans,
> canopy, dirichlet)?
>
> I also have one more theoretical question: I get for the cluster with the
> highest "points" a term - the third by weight which is at the same time with
> word freq = 9 - according to Solr Dictionary (and according to my knowledge
> of the corpora too) - this is for 23 000+ input docs. Is it something with
> the kmeans algorithm? the rest of the terms, clusters seem to be somehow ok,
> but that one really astonished me, I am almost sure it is not a problem with
> the (index - dictionary mapping) like I had before ;) (but that was general
> problem then - I was using the wrong dictionary file).
> I am running with convergence 0.5 is that ok?
>
> Best regards,
> Bogdan
>
>

Re: Clustering techniques, tips and tricks

Posted by Bogdan Vatkov <bo...@gmail.com>.

Is there a description of the output structure of the results, I see also
some folders like points which is used by the ClusterDumper but I do not
know the technical details.
I would be interested what kind of data is available as a result of the
clustering. Is it different when different algorithm is used (kmeans,
canopy, dirichlet)?

I also have one more theoretical question: I get for the cluster with the
highest "points" a term - the third by weight which is at the same time with
word freq = 9 - according to Solr Dictionary (and according to my knowledge
of the corpora too) - this is for 23 000+ input docs. Is it something with
the kmeans algorithm? the rest of the terms, clusters seem to be somehow ok,
but that one really astonished me, I am almost sure it is not a problem with
the (index - dictionary mapping) like I had before ;) (but that was general
problem then - I was using the wrong dictionary file).
I am running with convergence 0.5 is that ok?

Best regards,
Bogdan

Re: Clustering techniques, tips and tricks

Posted by Bogdan Vatkov <bo...@gmail.com>.

Is there some description of the content of the cluster vector?
I also noticed that I end up with some folders clusters-0 and clusters-1,
but sometimes it is only clusters-0, when do we get the different folders
and which should be used as end result - e.g. by the ClusterDumper?

On Wed, Jan 6, 2010 at 4:31 AM, Ted Dunning <te...@gmail.com> wrote:

> On Tue, Jan 5, 2010 at 6:12 PM, Bogdan Vatkov <bogdan.vatkov@gmail.com
> >wrote:
>
> >
> > what do you mean by using the document ID and that vectors can take
> labels?
> > is it something I could use right away from the current cluster vectors
> of
> > should I change some Mahout code to get to the documents ID?
> >
>
> This means that a vector can have a name.  That name can be the ID of your
> original document.
>
> I think, but am not sure, that you can use this now.  Grant would know
> better.
>
> --
> Ted Dunning, CTO
> DeepDyve
>

-- 
Best regards,
Bogdan

Re: Clustering techniques, tips and tricks

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Jan 5, 2010 at 6:12 PM, Bogdan Vatkov <bo...@gmail.com>wrote:

>
> what do you mean by using the document ID and that vectors can take labels?
> is it something I could use right away from the current cluster vectors of
> should I change some Mahout code to get to the documents ID?
>

This means that a vector can have a name.  That name can be the ID of your
original document.

I think, but am not sure, that you can use this now.  Grant would know
better.

-- 
Ted Dunning, CTO
DeepDyve

Re: Clustering techniques, tips and tricks

Posted by Bogdan Vatkov <bo...@gmail.com>.

>>> stopwords inside) - maybe it is a question on its own - how can I easily
go
>>> back from clusters->original docs (an not just vectors), I do not know
>>> maybe
>>> some kind of mapper which maps vectors to the original documents somehow
>>> (e.g. sort of URL for a document based on the vector id/index or
>>> something?).
>>>
>>
>> To do this, you should use the document ID and just return the original
>> content from some other content store.  Lucene or especially SOLR can
help
>> with this.

> Right, Mahout's vector can take labels.

what do you mean by using the document ID and that vectors can take labels?
is it something I could use right away from the current cluster vectors of
should I change some Mahout code to get to the documents ID?

Re: Clustering techniques, tips and tricks

Posted by Ted Dunning <te...@gmail.com>.

Either would work.  Spilling vectors from a Lucene index is more developed
at this point in time and thus would probably be easier.

On Fri, Jan 1, 2010 at 7:45 AM, Bogdan Vatkov <bo...@gmail.com>wrote:

> Do you mean I have to apply stemming during the vector creation or already
> in Lucene indexing? Maybe from clustering POV it is the same but what would
> you recommend?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Clustering techniques, tips and tricks

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 1, 2010, at 10:45 AM, Bogdan Vatkov wrote:
> I do not know how this updatable clustering works (using previous results as
> centroids for new clusterings), is there an example I could see in action?
> Additionally I would like to see an example of how could one combine Canopy
> and k-means,  I just saw this described in theory somehow but could not find
> an example of it.

You should just be able to run Canopy and then use the output of that as the value for --clusters on the KMeansDriver.

Re: Clustering techniques, tips and tricks

Posted by Ted Dunning <te...@gmail.com>.

I don't think that this code exists yet.  It is possible in theory, but not
yet reduced to practice.

On Fri, Jan 1, 2010 at 7:45 AM, Bogdan Vatkov <bo...@gmail.com>wrote:

> I do not know how this updatable clustering works (using previous results
> as
> centroids for new clusterings), is there an example I could see in action?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Clustering techniques, tips and tricks

Posted by Ted Dunning <te...@gmail.com>.

If you start gathering data about whether a particular user would like a
document then you might also be interested in the on-line learning
algorithms that are currently being developed.  These should be able to let
the user say whether they wanted to read a document or not.  The clustering
could be used initially, but as interest information is collected, you can
build a per user profile of interest.

On Fri, Jan 1, 2010 at 7:45 AM, Bogdan Vatkov <bo...@gmail.com>wrote:

> > Don't people usually see if the new docs fit into an existing cluster and
> > if they are a good fit, add them there, otherwise, maybe put them in the
> > best match and kick off a new job.
> >
> > Actually this question goes back to the original attempt - to analyze
> documents automatically by the machine, and not by people :). One of my
> goals is to not read the new document but rather the system to tell me if I
> should read it ;) - e.g. if it gets clustered/classified against given
> cluster/topic which I am interested (not interested) in I could then take
> more informed decision whether to read it (not to read it).

-- 
Ted Dunning, CTO
DeepDyve

Re: Clustering techniques, tips and tricks

Posted by Bogdan Vatkov <bo...@gmail.com>.

On Fri, Jan 1, 2010 at 3:24 PM, Grant Ingersoll <gs...@apache.org> wrote:

>
> On Jan 1, 2010, at 5:00 AM, Ted Dunning wrote:
>
> > On Thu, Dec 31, 2009 at 10:41 PM, Bogdan Vatkov <bogdan.vatkov@gmail.com
> >wrote:
> >
> >>
> >> I would like to give some feedback. And ask some questions as well :).
> >>
> >
> > Thank you!
> >
> > Very helpful feedback.
> >
> >
> >> ... Carrot2 for 2 weeks ... has great level of
> >> usability and simplicity but ...I had to give up on it since my very
> first
> >> practical clustering task required to cluster 23K+ documents.
> >
> >
> > Not too surprising.
>
> Right, Carrot2 is designed for clustering search results, and of that
> mainly the title and snippet.  While it can do larger docs, they are
> specifically not the target.  Plus, C2 is an in-memory tool designed to be
> very fast for search results.
>
>
> >
> >
> >> ...
> >> I have managed to do some clustering on my 23 000+ docs with
> Mahout/k-means
> >> for something like 10 min (in standalone mode - no parallel processing
> at
> >> all, I didn't even use all of my (3:-) ) cores yet with Hadoop/Mahout)
> but
> >> I
> >> am still learning and still trying to analyze if the result clusters are
> >> really meaningful for my docs.
> >>
> >
> > I have seen this effect before where a map-reduce program run
> sequentially
> > is much faster than an all-in-memory implementation.
> >
> >
> >> One thing I can tell already now is that I definitely, desperately, need
> >> word-stopping
> >
> >
> > You should be able to do this in the document -> vector conversion.  You
> > could also do this at the vector level by multiplying the coordinates of
> all
> > stop words by zero, but that is not as nice a solution.
>
> Right, or if you are using the Lucene extraction method, at Lucene indexing
> time.
>
>
Ok, so it seems I have to use the stop wording feature of Lucene itself,
right? I just saw there is something about stop words in Lucene but I am yet
to find out how to use that capability.


> >
> >
> >> ... But it would be valuable for me to be able
> >> to come back later to the complete context of a document (i.e. with the
> >> stopwords inside) - maybe it is a question on its own - how can I easily
> go
> >> back from clusters->original docs (an not just vectors), I do not know
> >> maybe
> >> some kind of mapper which maps vectors to the original documents somehow
> >> (e.g. sort of URL for a document based on the vector id/index or
> >> something?).
> >>
> >
> > To do this, you should use the document ID and just return the original
> > content from some other content store.  Lucene or especially SOLR can
> help
> > with this.
>
> Right, Mahout's vector can take labels.
>
> >
> >
> >> ...
> >> I think I will get better results if I can also apply stemming. What
> would
> >> be you recommendation when using mahout? Should I do the stemming again
> >> somewhere in the input vector forming?
> >
> >
> > Yes.  That is exactly correct.
>
> Again, really easy to do if you use the Lucene method for creating vectors.
>
> Do you mean I have to apply stemming during the vector creation or already
in Lucene indexing? Maybe from clustering POV it is the same but what would
you recommend?


> >
> > It is also really essential for me to have "updateable" algorithms as I
> am
> >> adding new documents on daily basis, and I definitely like to have them
> >> clustered immediately (incrementally) - I do not know if this is what is
> >> called "classification" in Mahout and I did not reach these examples yet
> (I
> >> wanted to really get acquainted with the clustering first).
> >>
> >
> > I can't comment on exactly how this should be done, but we definitely
> need
> > to support this use case.
>
> Don't people usually see if the new docs fit into an existing cluster and
> if they are a good fit, add them there, otherwise, maybe put them in the
> best match and kick off a new job.
>
> Actually this question goes back to the original attempt - to analyze
documents automatically by the machine, and not by people :). One of my
goals is to not read the new document but rather the system to tell me if I
should read it ;) - e.g. if it gets clustered/classified against given
cluster/topic which I am interested (not interested) in I could then take
more informed decision whether to read it (not to read it).

>
> >
> >
> >> And that is not all - I do not only want to have new documents clustered
> >> against existing clusters but what I want in addition is that clusters
> >> could
> >> actually change with new docs coming.
> >>
> >
> > Exactly.  This is easy algorithmically with k-means.  It just needs to be
> > supported by the software.
>
> Makes sense and shouldn't be that hard to do.  I'd imagine we just need to
> be able to use the centroids from the previous run as the seeds for the new
> run.
>
> >
> >
> >> Of course one could not observe new clusters popping up after a single
> new
> >> doc is added to the analysis but clusters should really be
> >> adaptable/updateable with new docs.
> >>
> >
> > Yes.  It is eminently doable.  Occasionally you should run back through
> all
> > of the document vectors so you can look at old documents in light of new
> > data but that should be very, very fast in your case.
>
>
I do not know how this updatable clustering works (using previous results as
centroids for new clusterings), is there an example I could see in action?
Additionally I would like to see an example of how could one combine Canopy
and k-means,  I just saw this described in theory somehow but could not find
an example of it.

Best regards,
Bogdan

Re: Clustering techniques, tips and tricks

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 1, 2010, at 5:00 AM, Ted Dunning wrote:

> On Thu, Dec 31, 2009 at 10:41 PM, Bogdan Vatkov <bo...@gmail.com>wrote:
> 
>> 
>> I would like to give some feedback. And ask some questions as well :).
>> 
> 
> Thank you!
> 
> Very helpful feedback.
> 
> 
>> ... Carrot2 for 2 weeks ... has great level of
>> usability and simplicity but ...I had to give up on it since my very first
>> practical clustering task required to cluster 23K+ documents.
> 
> 
> Not too surprising.

Right, Carrot2 is designed for clustering search results, and of that mainly the title and snippet.  While it can do larger docs, they are specifically not the target.  Plus, C2 is an in-memory tool designed to be very fast for search results.


> 
> 
>> ...
>> I have managed to do some clustering on my 23 000+ docs with Mahout/k-means
>> for something like 10 min (in standalone mode - no parallel processing at
>> all, I didn't even use all of my (3:-) ) cores yet with Hadoop/Mahout) but
>> I
>> am still learning and still trying to analyze if the result clusters are
>> really meaningful for my docs.
>> 
> 
> I have seen this effect before where a map-reduce program run sequentially
> is much faster than an all-in-memory implementation.
> 
> 
>> One thing I can tell already now is that I definitely, desperately, need
>> word-stopping
> 
> 
> You should be able to do this in the document -> vector conversion.  You
> could also do this at the vector level by multiplying the coordinates of all
> stop words by zero, but that is not as nice a solution.

Right, or if you are using the Lucene extraction method, at Lucene indexing time.


> 
> 
>> ... But it would be valuable for me to be able
>> to come back later to the complete context of a document (i.e. with the
>> stopwords inside) - maybe it is a question on its own - how can I easily go
>> back from clusters->original docs (an not just vectors), I do not know
>> maybe
>> some kind of mapper which maps vectors to the original documents somehow
>> (e.g. sort of URL for a document based on the vector id/index or
>> something?).
>> 
> 
> To do this, you should use the document ID and just return the original
> content from some other content store.  Lucene or especially SOLR can help
> with this.

Right, Mahout's vector can take labels.

> 
> 
>> ...
>> I think I will get better results if I can also apply stemming. What would
>> be you recommendation when using mahout? Should I do the stemming again
>> somewhere in the input vector forming?
> 
> 
> Yes.  That is exactly correct.

Again, really easy to do if you use the Lucene method for creating vectors.


> 
> It is also really essential for me to have "updateable" algorithms as I am
>> adding new documents on daily basis, and I definitely like to have them
>> clustered immediately (incrementally) - I do not know if this is what is
>> called "classification" in Mahout and I did not reach these examples yet (I
>> wanted to really get acquainted with the clustering first).
>> 
> 
> I can't comment on exactly how this should be done, but we definitely need
> to support this use case.

Don't people usually see if the new docs fit into an existing cluster and if they are a good fit, add them there, otherwise, maybe put them in the best match and kick off a new job.


> 
> 
>> And that is not all - I do not only want to have new documents clustered
>> against existing clusters but what I want in addition is that clusters
>> could
>> actually change with new docs coming.
>> 
> 
> Exactly.  This is easy algorithmically with k-means.  It just needs to be
> supported by the software.

Makes sense and shouldn't be that hard to do.  I'd imagine we just need to be able to use the centroids from the previous run as the seeds for the new run.

> 
> 
>> Of course one could not observe new clusters popping up after a single new
>> doc is added to the analysis but clusters should really be
>> adaptable/updateable with new docs.
>> 
> 
> Yes.  It is eminently doable.  Occasionally you should run back through all
> of the document vectors so you can look at old documents in light of new
> data but that should be very, very fast in your case.

Re: Clustering techniques, tips and tricks

Posted by Ted Dunning <te...@gmail.com>.

On Thu, Dec 31, 2009 at 10:41 PM, Bogdan Vatkov <bo...@gmail.com>wrote:

>
> I would like to give some feedback. And ask some questions as well :).
>

Thank you!

Very helpful feedback.

> ... Carrot2 for 2 weeks ... has great level of
> usability and simplicity but ...I had to give up on it since my very first
> practical clustering task required to cluster 23K+ documents.

Not too surprising.

> ...
> I have managed to do some clustering on my 23 000+ docs with Mahout/k-means
> for something like 10 min (in standalone mode - no parallel processing at
> all, I didn't even use all of my (3:-) ) cores yet with Hadoop/Mahout) but
> I
> am still learning and still trying to analyze if the result clusters are
> really meaningful for my docs.
>

I have seen this effect before where a map-reduce program run sequentially
is much faster than an all-in-memory implementation.

> One thing I can tell already now is that I definitely, desperately, need
> word-stopping

You should be able to do this in the document -> vector conversion.  You
could also do this at the vector level by multiplying the coordinates of all
stop words by zero, but that is not as nice a solution.

> ... But it would be valuable for me to be able
> to come back later to the complete context of a document (i.e. with the
> stopwords inside) - maybe it is a question on its own - how can I easily go
> back from clusters->original docs (an not just vectors), I do not know
> maybe
> some kind of mapper which maps vectors to the original documents somehow
> (e.g. sort of URL for a document based on the vector id/index or
> something?).
>

To do this, you should use the document ID and just return the original
content from some other content store.  Lucene or especially SOLR can help
with this.

> ...
> I think I will get better results if I can also apply stemming. What would
> be you recommendation when using mahout? Should I do the stemming again
> somewhere in the input vector forming?

Yes.  That is exactly correct.

It is also really essential for me to have "updateable" algorithms as I am
> adding new documents on daily basis, and I definitely like to have them
> clustered immediately (incrementally) - I do not know if this is what is
> called "classification" in Mahout and I did not reach these examples yet (I
> wanted to really get acquainted with the clustering first).
>

I can't comment on exactly how this should be done, but we definitely need
to support this use case.

> And that is not all - I do not only want to have new documents clustered
> against existing clusters but what I want in addition is that clusters
> could
> actually change with new docs coming.
>

Exactly.  This is easy algorithmically with k-means.  It just needs to be
supported by the software.

> Of course one could not observe new clusters popping up after a single new
> doc is added to the analysis but clusters should really be
> adaptable/updateable with new docs.
>

Yes.  It is eminently doable.  Occasionally you should run back through all
of the document vectors so you can look at old documents in light of new
data but that should be very, very fast in your case.

Re: Clustering techniques, tips and tricks

Posted by Abbas <ab...@gmail.com>.

Hi Bogdan,

This is in reply to your previous post where you asked about having word-
stoppers 
in Mahout. 

Well, recently I was fighting with the same thing and found a solution, 
which worked perfectly fine. What you should do is - 
1. Create your own (customized) Lucene Analyzer by extending Analyzer class 
and overriding tokenStream method

2. Create a jar file containing your custom analyzer. Make sure to have your 
lucene jar file in the MANIFEST.mf. 

3. Place the jar in mahout/examples/target/dependency. In case you get 
ClassNotFoundException in the next step, you may like to put the two jar files 
in 
hadoop/lib/ as well. Also you can try making entries of the jar files in 
HADOOP_CLASSPATH and CLASSPATH environment variable.

4. Then run your seq2sparse command by mentioning your custom analyzer in -a 
parameter

5. Run your k-means command as you would otherwise do.

Hope this helps

If you need the complete code for custom analyzer, let me know.

Thanks
Abbas

Re: Clustering techniques, tips and tricks

Posted by Bogdan Vatkov <bo...@gmail.com>.

Hi guys,

First of all - Happy New Year -  wish you healthy and successful 2010 year!

I would like to give some feedback. And ask some questions as well :).

I find the classification above very useful (I actually found some details
on the behavior of the k-means and Dirichlet algorithms that I did not see
before in the docs around Mahout).
I played around with Carrot2 for 2 weeks and it really has great level of
usability and simplicity but ...I had to give up on it since my very first
practical clustering task required to cluster 23K+ documents. And while I
was able to run the Lingo algorithm on 5 000 docs with 6 GB of heap memory
for approx. 2 min. it "failed" when I tried to cluster 8 000 docs...(it just
ran for 30 mins and I had to kill the app since it was definitely not
practical - compared to 5 000 docs clustered for 2 minutes).
That was the point where I really had to give up on Carrot2 and started
learning Mahout/Hadoop.

I have managed to do some clustering on my 23 000+ docs with Mahout/k-means
for something like 10 min (in standalone mode - no parallel processing at
all, I didn't even use all of my (3:-) ) cores yet with Hadoop/Mahout) but I
am still learning and still trying to analyze if the result clusters are
really meaningful for my docs.

One thing I can tell already now is that I definitely, desperately, need
word-stopping  (which worked like a charm in Carrot2) but I am struggling a
little bit with Mahout code to come up with the right way to apply
stopwords. As far as I know there is no such feature already in the code. I
could use any piece of advice from you guys on where and how to apply
stopwords in Mahout clustering algorithms. Or maybe I should do it already
during the Text-to-Vector phase? But it would be valuable for me to be able
to come back later to the complete context of a document (i.e. with the
stopwords inside) - maybe it is a question on its own - how can I easily go
back from clusters->original docs (an not just vectors), I do not know maybe
some kind of mapper which maps vectors to the original documents somehow
(e.g. sort of URL for a document based on the vector id/index or
something?).

So, I definitely need to apply stopwords - without it I can't make any use
of my clusters - at least the ones I received after running (one iteration)
k-means over my original docs (Lucene derived vector) were not really
practical as I ended up with lots of clusters based on words that should
have been stopped.

Another question I have (maybe I simply overlooked something in the docs)
but as far as I got it one should really run the Canopy algorithm first to
get some "good" seeds for initial clusters and only then go for k-means
clustering but I could not really find the example doing so, as far as I got
it the examples are showing how to run the different algorithms (Canopy,
k-means) but only separately and not like chained one ofter the other. Did I
miss something? Can I get more guidance on how to chain Canopy and k-means?
Yet another question - I read in some docs around that k-means actually has
to be run multiple times somehow but I could not find it in the very
examples. How can I do that?

One thing is for sure - I cannot go back to Carrot2 - it is very useful for
small volumes of docs but won't work on what I am heading for:
My initial attempt is to cluster 23K+ docs but then I would like to move to
my complete set of 100K+ docs.

I would definitely like to be in control of the number (parametrized) of
clusters.
I think I will get better results if I can also apply stemming. What would
be you recommendation when using mahout? Should I do the stemming again
somewhere in the input vector forming? Or do you expect this to be part of
the clustering algorithms?

It is also really essential for me to have "updateable" algorithms as I am
adding new documents on daily basis, and I definitely like to have them
clustered immediately (incrementally) - I do not know if this is what is
called "classification" in Mahout and I did not reach these examples yet (I
wanted to really get acquainted with the clustering first).
And that is not all - I do not only want to have new documents clustered
against existing clusters but what I want in addition is that clusters could
actually change with new docs coming.
Of course one could not observe new clusters popping up after a single new
doc is added to the analysis but clusters should really be
adaptable/updateable with new docs.
Would that be possible with k-means for example?

I could give more feedback and possibly more questions :) when get a little
bit further on this.

Best regards,
Bogdan

On Thu, Dec 31, 2009 at 10:39 PM, Ted Dunning <te...@gmail.com> wrote:

> On Thu, Dec 31, 2009 at 9:10 AM, Grant Ingersoll <gsingers@apache.org
> >wrote:
>
> > As some of you may know, I'm working on a book (it's a long time coming,
> > but I'm getting there) about open source techniques for working with
> text.
> > ...
> > Based on my research, it seems people typically divide up the clustering
> > space into two approaches: hierarchical and flat/partitioning.
>
>
> There are a number of ways of dividing the space of all clustering
> techniques.  Whether the final result is hierarchical is an important
> criterion.
>
> Other important characteristics include:
>
> - are cluster memberships hard or soft (k-means = hard, LDA and Dirichlet =
> soft)
>
> - can the clustering algorithm be viewed in a probabilistic framework
> (k-means, LDA, Dirichlet = yes, agglomerative clustering using nearest
> neighbors = not so much)
>
> - is the definition of a cluster abstract enough to be flexible with regard
> to whether a cluster is a model or does it require stronger limits.
> (k-means = symmetric Gaussian with equal variance, Dirichlet = almost any
> probabilistic model)
>
> - is the algorithm updateable or does it have to run from scratch (k-means,
> Dirichlet = yes, agglomerative clustering = not easily)
>
> - is the algorithm non-parametric (which for clustering pretty much reduces
> to whether the number and complexity of clusters can increase without bound
> as the amount of data increases, Dirchlet = yes)
>
> - does the algorithm operationally take linear time in the size of the data
> (k-means yes, LDA = not sure, Dirichlet = pretty much, agglomerative
> clustering = no for most algorithms)
>
> - can the algorithm make use of pre-existing knowledge or user adjustments?
> (k-means yes, Dirichlet yes)
>
> Note that it is pretty easy to adapt several algorithms like k-means to be
> hierarchical.
>
>
> > In overlaying that knowledge with what we have for techniques in Mahout,
> > I'm a bit stumped about where things like LDA and Dirichlet fit into
> those
> > two approaches or is there, perhaps a third that I'm missing?  They don't
> > seem particularly hierarchical but they don't seem flat either, if that
> > makes any sense, given the probabilistic/mixture nature of the
> algorithms.
> >  Perhaps I should forgo the traditional division that previous authors
> have
> > taken and just talk about a suite of techniques at a little lower level?
> >  Thoughts?
> >
>
> I think that some of these distinctions are interesting but I think it is
> also very easy to confuse newcomers with too many distinctions.  A big part
> of the confusion has to do with the fact that none of these distinctions is
> comprehensive, nor are any of these completely clear cut.
>
>
> > The other thing I'm interested in is people's real world feedback on
> using
> > clustering to solve their text related problems.  For instance, what type
> of
> > feature reduction did you do (stopword removal, stemming, etc.)?  What
> > algorithms worked for you?  What didn't work?
>
>
> My experience has been that almost any reasonable implementation of an
> algorithm in the k-means family will work reasonably well (this includes
> LDA
> and Dirichlet effectively) and none of them are completely excellent.
> Success with text clustering strongly depends on setting expectations
> correctly in the interface.  If this user expects the clustering to be
> useful but not definitive then they tend to find clustering to be high
> value.  If the user expects the clustering to be absolutely definitive,
> they
> will be sorely disappointed.  For cases where perfection is expected, it is
> often important to have good integration between clustering and human
> edits.
>
> The carrot2 implementation is among the best that I have used in terms of
> the quality of clustering small batches of results such as search result
> lists.
>

-- 
Bogdan Vatkov
email: bogdan.vatkov@gmail.com
phone: +359 889 197 756

Re: Clustering techniques, tips and tricks

Posted by Ted Dunning <te...@gmail.com>.

Clustering in the k-means style can be viewed as fitting a mixture model to
the data.  A mixture model is a model of the generation of random data where
you first pick one of n sub-models and then generate a point from the
sub-model.  There are parameters that express the probability of each
sub-model and then the parameters for each sub-model.

With k-means, you don't explicitly compute the mixing parameters and you
assume that the sub-models are symmetric Gaussians with fixed variance.

One of the simplest algorithms for this is the so-called E-M algorithm.  For
mixture distributions, this algorithm breaks into two passes.  The first
assigns data points to models, the second estimates each sub-model from the
data assigned to that model.  This algorithm is easy to implement and works
well in some cases, particularly for k-means.

Unfortunately, the E-M algorithm as it stands is doing what is called a
maximum likelihood fit.  If you take k-means and try to use a more elaborate
model than identical Gaussians, the use of maximum likelihood instantly
causes horrible problems because if you allow the variance to vary, the
algorithm will position some clusters on a single point and drive the
variance to zero.  The result is some point-like clusters that each describe
a single observation, and a few diffuse clusters to handle the rest of the
data.  Generalization goes out the window because the point-like clusters
won't describe new data at all.

So... back to your question.  k-means, LDA and the Dirichlet clustering are
all implementations of E-M algorithms for fitting mixture distributions (or
first cousins to avoid the max-likelihood traps).  This allows very nice
mathematical frameworks to be brought to bear on the problem of how to make
these algorithms work well.  For the second question, the distinction is
based on the fact that k-means is pretty inflexible with regard to the
sub-models because of the maximum likelihood problems.  The Dirichlet
clustering tackles this problem head-on and thus allows much more general
models.

This transition from purely distance based metaphors to probabilistic
mixture models as a fundamental metaphor for clustering is a difficult one
to make.  You can work to interpret the mixture models in terms of the
distance based metaphor, but I find that counter-productive because the
analogies used lead to some very dangerous algorithms without obvious
repair.  Viewing the problem as a statistical learning problems not only
makes the extension of the algorithms to new domains easier, but it provides
nice fixes to the problems encountered.  If you want to stick with the
distance (similarity) based metaphor, you can view the probability p(X |
m_i) of an observed data point X given the i-th model m_i as the similarity
of the data X to model m_i.  Then the estimation of the new version of the
model from a number of data points can be considered to be the analog of the
centroid computation in k-means.

On Sun, Jan 3, 2010 at 8:10 AM, Grant Ingersoll <gs...@apache.org> wrote:

>
> On Dec 31, 2009, at 3:39 PM, Ted Dunning wrote:
>
>
> > - can the clustering algorithm be viewed in a probabilistic framework
> > (k-means, LDA, Dirichlet = yes, agglomerative clustering using nearest
> > neighbors = not so much)
> >
> > - is the definition of a cluster abstract enough to be flexible with
> regard
> > to whether a cluster is a model or does it require stronger limits.
> > (k-means = symmetric Gaussian with equal variance, Dirichlet = almost any
> > probabilistic model)
>
> Can you elaborate a bit more on these two?  I can see a bit on the
> probability side, as those approaches play a factor in how similarity is
> determined, but I don't get the significance of "cluster as a model".  Is it
> just a simplification that then makes it easier to ask: does this document
> fit into the model?
>
> -Grant

-- 
Ted Dunning, CTO
DeepDyve

Re: Clustering techniques, tips and tricks

Posted by Grant Ingersoll <gs...@apache.org>.

On Dec 31, 2009, at 3:39 PM, Ted Dunning wrote:

> - can the clustering algorithm be viewed in a probabilistic framework
> (k-means, LDA, Dirichlet = yes, agglomerative clustering using nearest
> neighbors = not so much)
> 
> - is the definition of a cluster abstract enough to be flexible with regard
> to whether a cluster is a model or does it require stronger limits.
> (k-means = symmetric Gaussian with equal variance, Dirichlet = almost any
> probabilistic model)

Can you elaborate a bit more on these two?  I can see a bit on the probability side, as those approaches play a factor in how similarity is determined, but I don't get the significance of "cluster as a model".  Is it just a simplification that then makes it easier to ask: does this document fit into the model?

-Grant

Re: Clustering techniques, tips and tricks

Posted by Ted Dunning <te...@gmail.com>.

On Thu, Dec 31, 2009 at 9:10 AM, Grant Ingersoll <gs...@apache.org>wrote:

> As some of you may know, I'm working on a book (it's a long time coming,
> but I'm getting there) about open source techniques for working with text.
> ...
> Based on my research, it seems people typically divide up the clustering
> space into two approaches: hierarchical and flat/partitioning.

There are a number of ways of dividing the space of all clustering
techniques.  Whether the final result is hierarchical is an important
criterion.

Other important characteristics include:

- are cluster memberships hard or soft (k-means = hard, LDA and Dirichlet =
soft)

- can the clustering algorithm be viewed in a probabilistic framework
(k-means, LDA, Dirichlet = yes, agglomerative clustering using nearest
neighbors = not so much)

- is the definition of a cluster abstract enough to be flexible with regard
to whether a cluster is a model or does it require stronger limits.
(k-means = symmetric Gaussian with equal variance, Dirichlet = almost any
probabilistic model)

- is the algorithm updateable or does it have to run from scratch (k-means,
Dirichlet = yes, agglomerative clustering = not easily)

- is the algorithm non-parametric (which for clustering pretty much reduces
to whether the number and complexity of clusters can increase without bound
as the amount of data increases, Dirchlet = yes)

- does the algorithm operationally take linear time in the size of the data
(k-means yes, LDA = not sure, Dirichlet = pretty much, agglomerative
clustering = no for most algorithms)

- can the algorithm make use of pre-existing knowledge or user adjustments?
(k-means yes, Dirichlet yes)

Note that it is pretty easy to adapt several algorithms like k-means to be
hierarchical.

> In overlaying that knowledge with what we have for techniques in Mahout,
> I'm a bit stumped about where things like LDA and Dirichlet fit into those
> two approaches or is there, perhaps a third that I'm missing?  They don't
> seem particularly hierarchical but they don't seem flat either, if that
> makes any sense, given the probabilistic/mixture nature of the algorithms.
>  Perhaps I should forgo the traditional division that previous authors have
> taken and just talk about a suite of techniques at a little lower level?
>  Thoughts?
>

I think that some of these distinctions are interesting but I think it is
also very easy to confuse newcomers with too many distinctions.  A big part
of the confusion has to do with the fact that none of these distinctions is
comprehensive, nor are any of these completely clear cut.

> The other thing I'm interested in is people's real world feedback on using
> clustering to solve their text related problems.  For instance, what type of
> feature reduction did you do (stopword removal, stemming, etc.)?  What
> algorithms worked for you?  What didn't work?

My experience has been that almost any reasonable implementation of an
algorithm in the k-means family will work reasonably well (this includes LDA
and Dirichlet effectively) and none of them are completely excellent.
Success with text clustering strongly depends on setting expectations
correctly in the interface.  If this user expects the clustering to be
useful but not definitive then they tend to find clustering to be high
value.  If the user expects the clustering to be absolutely definitive, they
will be sorely disappointed.  For cases where perfection is expected, it is
often important to have good integration between clustering and human edits.

The carrot2 implementation is among the best that I have used in terms of
the quality of clustering small batches of results such as search result
lists.

Re: Clustering techniques, tips and tricks

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 2, 2010, at 2:15 AM, Shashikant Kore wrote:

> On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> 
>> The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems.
>> For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)?  What algorithms worked for you?
>> What didn't work?  Any and all insight is welcome and I don't particularly care if it is Mahout specific (for instance, part of
>> the chapter is about search result clustering using Carrot2 and so Mahout isn't applicable)
>> 
> 
> Let me start by saying Mahout works great for us. We can run k-means
> on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a
> single host.
> 
> Using vector normalization like L2 norm helped quite a bit. Thanks to
> Ted for this suggestion. In text clustering, you have lots of small
> documents. This results into very sparse vectors (total of 100K
> features with each vector having 200 features.) Using vanilla TFIDF
> weights doesn't work as nicely.
> 
> Even if we don't do explicit stop word removal, the threshold values
> for document count does that in a better fashion. If you exclude the
> features which are extremely common (say more than 40% documents) or
> extremely rare (say in less than 50 documents in a corpus of 100K
> docs), you have a meaningful set of features. The current K-Means
> already accepts these parameters.

You mean the Lucene Driver that creates the vectors, right?

> 
> Stemming can be used for feature reduction, but it has a minor issue.
> When you want to find out prominent features of the resulting cluster
> centroid, the feature may not be meaningful. For example,  if a
> prominent feature is "beautiful", when you get it back, you will get
> "beauti." Ouch.

Right, but this is easily handled  via something like Lucene's highlighter functionality.  I bet it could be made to work on Mahout's vectors (+ a dictionary) fairly easily.


> 
> I tried fuzzy K-Means for soft clustering, but I didn't get good
> results. May be the corpus had the issue.
> 
> One observation about the clustering process is that it is geared, by
> accident or by design, towards batch processing. There is no
> support for real-time clustering. There needs to be glue which ties
> all the components together to make the process seamless. I suppose,
> someone in need of this feature will contribute it to Mahout.

Right.  This should be pretty easy to remedy, though.  One could simply use the previous results as the --clusters option, right?

> 
> Grant,  If I recall more, I will mail it to you.

Great!  Thank you.

> 
> --shashi

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search

Re: Clustering techniques, tips and tricks

Posted by Pallavi Palleti <pa...@corp.aol.com>.

Sure Ted. I will do that.

Thanks
Pallavi
Ted Dunning wrote:
> Pallavi,
>
> This is very useful feedback.
>
> What you have done is very similar to the k-means++ algorithm and it is
> clearly a very good thing.
>
> There is already an issue for tracking a k-means++ implementation:
> http://issues.apache.org/jira/browse/MAHOUT-153
>
> Could you post your patch there?
>
> On Mon, Jan 4, 2010 at 4:03 AM, Palleti, Pallavi <
> pallavi.palleti@corp.aol.com> wrote:
>
>   
>> Initially, I used canopy clustering seeds as initial seeds but the results
>> weren't good and the number of clusters depends on the distance thresholds
>> we give as input. Later, I have considered randomly selecting some points
>> from the input dataset and consider them as initial seeds. Again, the
>> results were not good. Now, I have chosen initial seeds from input set in
>> such a way that the points are far from each other and I have observed
>> better clustering using Fuzzy Kmeans. I have not implemented a map-reducable
>> version for this seed selection. I will soon implement a map-reducable
>> version and submit a patch.
>>
>>     
>
>
>
>

Re: Clustering techniques, tips and tricks

Posted by Ted Dunning <te...@gmail.com>.

Pallavi,

This is very useful feedback.

What you have done is very similar to the k-means++ algorithm and it is
clearly a very good thing.

There is already an issue for tracking a k-means++ implementation:
http://issues.apache.org/jira/browse/MAHOUT-153

Could you post your patch there?

On Mon, Jan 4, 2010 at 4:03 AM, Palleti, Pallavi <
pallavi.palleti@corp.aol.com> wrote:

> Initially, I used canopy clustering seeds as initial seeds but the results
> weren't good and the number of clusters depends on the distance thresholds
> we give as input. Later, I have considered randomly selecting some points
> from the input dataset and consider them as initial seeds. Again, the
> results were not good. Now, I have chosen initial seeds from input set in
> such a way that the points are far from each other and I have observed
> better clustering using Fuzzy Kmeans. I have not implemented a map-reducable
> version for this seed selection. I will soon implement a map-reducable
> version and submit a patch.
>



-- 
Ted Dunning, CTO
DeepDyve

RE: Clustering techniques, tips and tricks

Posted by "Rao, Vaijanath" <va...@corp.aol.com>.

Hi Ted, 

My replies are inside

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Tuesday, January 05, 2010 12:01 PM
To: mahout-user@lucene.apache.org
Subject: Re: Clustering techniques, tips and tricks

This definition isn't quite solid.  Probably the problem is just writing
it down in English without mathematical notation.

On Mon, Jan 4, 2010 at 7:32 PM, Rao, Vaijanath
<va...@corp.aol.com>wrote:

> The idea is to have two distances one each for
> A) Across the cluster
> B) Within the cluster
>

Q? What do you mean be these, exactly?
Are these distances from a point to a cluster?  Between clusters?  The
maximum or minimum of all distances from a point to members of the
cluster?

--For Within the cluster distance
A cluster will have some points which are near to centroid and some may
be farther from the centroid. Distance within cluster implies the
distance of the point and the centroid of the cluster in which the point
belongs. The min distance represent the minimum distance a point has
with the centroid ( which is near to centroid), whereas the max distance
represents the maximum distance a point has with the centroid of same
cluster.

For across the cluster distance
This is the distance between the cetroids of the clusters. The min
distance here represents the minimum distance between two such cluster
centroids and max distance represents the maximum distance between the
centroids of two clusters.

Ideally we want any clustering algorithm to have smaller intra-cluster
distance ( distance between centroid and the point within the cluster )
and larger inter-cluster distance ( distance between two cluster
centroids )

In the iter-1 I allow inifite distance for both of these distance and
run
> the regular logic of clustering. At the end of it calculate the min 
> and max for both of these distances.
>

Q? Regular logic?  Do you mean k-means?
--Regular logic here implies the selection of the cluster based on data
point. In case od K-Means it is minimum distance. For Some other
algorithm it will be within this distance or threashold and so on. So
the assignment of data point to cluster remains as per the Choice of
Clustering algorithm

>  For the remaining iterations do the following
>

Q? To each data point?
--Yes for every data point

>
> A) find the Lowest distance cluster for the input point. If the 
> distance of the input point increases the distance within the cluster 
> then the previously assigned then ignore it and create a new cluster
with this point.
>
>

Q? What do you mean by increases the distance within the cluster?  The
sum of all distances from points in a cluster to the centroid?  Or do
you mean to
say that   the new assignment is farther than any previous member of the
cluster from the centroid?

--Here I meant that when the data point is assigned to this cluster it
incrreased the within cluster ( intra-cluster distance )

Q? What are the new min and max distances for the new cluster?
For new clusters that are created it will be set to infinite but if the
cluster already exisit the new min and max will be claculated based on
the points that are present in that cluster.

If it does not voilate the older distance keep it with following
probability
>        1. 1 if the distance of Input point is less then min distance.
>

Q? Min over what?  for just this cluster?
Yes, if new point assigned to this cluster is smaller then the prev min,
then it's added as is to centroid else this datapoint will carry a
smaller weight when added to centroid 

>         2. in the ratio of (max-distance of the input from 
> center)/max_distance
>

Q? These two probabilities to not have the same limit for distances just
larger and just smaller than the min distance.  Do you mean that the
second should be (max-distance to center) / (max - min) ?

I haven't played with the weightage, but your eq should give the values
in the range of 0-1 which defines the wt of the datapoint carries when
the centroid is calculated. I will use this

Q? What do you mean be "keep it"?  Keep the assignment?  What if you
don't keep
the assignment?   Do you then form a new cluster with this point?  If
you do
that, what is the min and max distance for the new cluster?

--If the data point is not assigned to any cluster then it becomes a new
cluster with min and max distance set to infinite. Which means when 

B) If there are two clusters which have cluster distance less then the
> minimum cluster distance then merge them and calculate the within 
> cluster distance and across cluster distance.
>

Q? What is "cluster distance"?  What is "minimum cluster distance" ?
How can any particular cluster distance be smaller than the smallest
cluster distance?
--Here what I meant was if after the execution of current iteration, if
there are two cluster centroids which have distance less then the last
iteration minimum inter-cluster distance, then we can merge these two
clusters to get one big cluster. At this point we need to recalculate
the inter-cluster distances and intra-cluster distances.

> I have observed that this has helped me in getting good clusters. 
> Currently I have a single version code which has the above calculation

> and I am wokring on getting this into Map-red and should be able to 
> submit a patch very soon.
>

Q? Does single version code mean sequential code?
--Yes

Q?If so, don't wait until you have a parallel version!   File a JIRA and
post
what you have!

--I will definitely do that

>  Let me know If you think this will generally work.
>

Q? I think that algorithms like this heuristic have a good chance of
working.
I don't see that this particular algorithm is going to handle high
dimensions very well, nor do I think it will necessarily converge
because it looks like there is always a good chance that cluster members
will be assigned to a new cluster.  In order to improve convergence, you
might consider annealing this new cluster behavior, possibly taking 1 -
Math.pow(1
- ratio-as-you-defined, iteration).  This will cause the clusters to
become more and more stable.

--I don't have any such idea on how good the convergernce is, but I will
try to add annealing as this will help it to be more stable.

Q? You may not be interested in a stable clustering.  The Dirichlet
clustering that we have is already a non-deterministic clustering
algorithm that produces a distribution over possible numbers of clusters
and cluster assignments.  Perhaps that is what your algorithm should be
creating.

--I need to look at this. As I wanted to have more stable clusters not
sure If I am going in the right direction to get those.

Q? I am also curious about whether your algorithm is a variant of
something like neural gas clustering.
--Assuming the cluster's grows and shrinks over a preroid of time then
it may be some form of neural gas. But cannot assume that this is neural
gas. As far as I know the neural gas has lot many features than what I
have written about.

--Thanks and Regards
Vaijanath N. Rao

Re: Clustering techniques, tips and tricks

Posted by Ted Dunning <te...@gmail.com>.

This definition isn't quite solid.  Probably the problem is just writing it
down in English without mathematical notation.

On Mon, Jan 4, 2010 at 7:32 PM, Rao, Vaijanath
<va...@corp.aol.com>wrote:

> The idea is to have two distances one each for
> A) Across the cluster
> B) Within the cluster
>

What do you mean be these, exactly?

Are these distances from a point to a cluster?  Between clusters?  The
maximum or minimum of all distances from a point to members of the cluster?

In the iter-1 I allow inifite distance for both of these distance and run
> the regular logic of clustering. At the end of it calculate the min and max
> for both of these distances.
>

Regular logic?  Do you mean k-means?

Min and max across all points against the clusters they are in?  Or against
all clusters?

>  For the remaining iterations do the following
>

To each data point?

>
> A) find the Lowest distance cluster for the input point. If the distance of
> the input point increases the distance within the cluster then the
> previously assigned then ignore it and create a new cluster with this point.
>
>

What do you mean by increases the distance within the cluster?  The sum of
all distances from points in a cluster to the centroid?  Or do you mean to
say that   the new assignment is farther than any previous member of the
cluster from the centroid?

What are the new min and max distances for the new cluster?

If it does not voilate the older distance keep it with following probability
>        1. 1 if the distance of Input point is less then min distance.
>

Min over what?  for just this cluster?

>         2. in the ratio of (max-distance of the input from
> center)/max_distance
>

These two probabilities to not have the same limit for distances just larger
and just smaller than the min distance.  Do you mean that the second should
be (max-distance to center) / (max - min) ?

What do you mean be "keep it"?  Keep the assignment?  What if you don't keep
the assignment?   Do you then form a new cluster with this point?  If you do
that, what is the min and max distance for the new cluster?

B) If there are two clusters which have cluster distance less then the
> minimum cluster distance then merge them and calculate the within cluster
> distance and across cluster distance.
>

What is "cluster distance"?  What is "minimum cluster distance" ?  How can
any particular cluster distance be smaller than the smallest cluster
distance?

> I have observed that this has helped me in getting good clusters. Currently
> I have a single version code which has the above calculation and I am
> wokring on getting this into Map-red and should be able to submit a patch
> very soon.
>

Does single version code mean sequential code?

If so, don't wait until you have a parallel version!   File a JIRA and post
what you have!

>  Let me know If you think this will generally work.
>

I think that algorithms like this heuristic have a good chance of working.
I don't see that this particular algorithm is going to handle high
dimensions very well, nor do I think it will necessarily converge because it
looks like there is always a good chance that cluster members will be
assigned to a new cluster.  In order to improve convergence, you might
consider annealing this new cluster behavior, possibly taking 1 - Math.pow(1
- ratio-as-you-defined, iteration).  This will cause the clusters to become
more and more stable.

You may not be interested in a stable clustering.  The Dirichlet clustering
that we have is already a non-deterministic clustering algorithm that
produces a distribution over possible numbers of clusters and cluster
assignments.  Perhaps that is what your algorithm should be creating.

I am also curious about whether your algorithm is a variant of something
like neural gas clustering.

RE: Clustering techniques, tips and tricks

Posted by "Rao, Vaijanath" <va...@corp.aol.com>.

Hi,

I have another suggestion that I am working at, it's not yet complete but I should be able to submit some patch in coming weeks.

The idea is to have two distances one each for
A) Across the cluster
B) Within the cluster 

These distances help in setting up the inclusion and exclusion of the input data point in the cluster.

In the iter-1 I allow inifite distance for both of these distance and run the regular logic of clustering. At the end of it calculate the min and max for both of these distances. 

For the remaining iterations do the following

A) find the Lowest distance cluster for the input point. If the distance of the input point increases the distance within the cluster then the previously assigned then ignore it and create a new cluster with this point. If it does not voilate the older distance keep it with following probability 
	1. 1 if the distance of Input point is less then min distance.
	2. in the ratio of (max-distance of the input from center)/max_distance

B) If there are two clusters which have cluster distance less then the minimum cluster distance then merge them and calculate the within cluster distance and across cluster distance.

I have observed that this has helped me in getting good clusters. Currently I have a single version code which has the above calculation and I am wokring on getting this into Map-red and should be able to submit a patch very soon.

Let me know If you think this will generally work.

--Thanks and Regards
Vaijanath 

-----Original Message-----
From: Palleti, Pallavi [mailto:pallavi.palleti@corp.aol.com] 
Sent: Monday, January 04, 2010 5:34 PM
To: mahout-user@lucene.apache.org
Subject: RE: Clustering techniques, tips and tricks

Here are my two cents.

I experimented with both kmeans and fuzzy kmeans on a data similar to movielens. And, what I observed is that the initial cluster seeds are more important. Initially, I used canopy clustering seeds as initial seeds but the results weren't good and the number of clusters depends on the distance thresholds we give as input. Later, I have considered randomly selecting some points from the input dataset and consider them as initial seeds. Again, the results were not good. Now, I have chosen initial seeds from input set in such a way that the points are far from each other and I have observed better clustering using Fuzzy Kmeans. I have not implemented a map-reducable version for this seed selection. I will soon implement a map-reducable version and submit a patch.

Also, regarding canopy, I found that what actually canopy is defined in theory is different from what we have in mahout. However, some one can clarify on that.

Thanks
Pallavi

-----Original Message-----
From: Shashikant Kore [mailto:shashikant@gmail.com]
Sent: Saturday, January 02, 2010 12:45 PM
To: mahout-user@lucene.apache.org
Subject: Re: Clustering techniques, tips and tricks

On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <gs...@apache.org> wrote:
>
> The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems.
> For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)?  What algorithms worked for you?
> What didn't work?  Any and all insight is welcome and I don't 
> particularly care if it is Mahout specific (for instance, part of the 
> chapter is about search result clustering using Carrot2 and so Mahout 
> isn't applicable)
>

Let me start by saying Mahout works great for us. We can run k-means on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a single host.

Using vector normalization like L2 norm helped quite a bit. Thanks to Ted for this suggestion. In text clustering, you have lots of small documents. This results into very sparse vectors (total of 100K features with each vector having 200 features.) Using vanilla TFIDF weights doesn't work as nicely.

Even if we don't do explicit stop word removal, the threshold values for document count does that in a better fashion. If you exclude the features which are extremely common (say more than 40% documents) or extremely rare (say in less than 50 documents in a corpus of 100K docs), you have a meaningful set of features. The current K-Means already accepts these parameters.

Stemming can be used for feature reduction, but it has a minor issue.
When you want to find out prominent features of the resulting cluster centroid, the feature may not be meaningful. For example,  if a prominent feature is "beautiful", when you get it back, you will get "beauti." Ouch.

I tried fuzzy K-Means for soft clustering, but I didn't get good results. May be the corpus had the issue.

One observation about the clustering process is that it is geared, by accident or by design, towards batch processing. There is no support for real-time clustering. There needs to be glue which ties all the components together to make the process seamless. I suppose, someone in need of this feature will contribute it to Mahout.

Grant,  If I recall more, I will mail it to you.

--shashi

RE: Clustering techniques, tips and tricks

Posted by "Palleti, Pallavi" <pa...@corp.aol.com>.

Here are my two cents.

I experimented with both kmeans and fuzzy kmeans on a data similar to movielens. And, what I observed is that the initial cluster seeds are more important. Initially, I used canopy clustering seeds as initial seeds but the results weren't good and the number of clusters depends on the distance thresholds we give as input. Later, I have considered randomly selecting some points from the input dataset and consider them as initial seeds. Again, the results were not good. Now, I have chosen initial seeds from input set in such a way that the points are far from each other and I have observed better clustering using Fuzzy Kmeans. I have not implemented a map-reducable version for this seed selection. I will soon implement a map-reducable version and submit a patch.

Also, regarding canopy, I found that what actually canopy is defined in theory is different from what we have in mahout. However, some one can clarify on that.

Thanks
Pallavi

-----Original Message-----
From: Shashikant Kore [mailto:shashikant@gmail.com] 
Sent: Saturday, January 02, 2010 12:45 PM
To: mahout-user@lucene.apache.org
Subject: Re: Clustering techniques, tips and tricks

On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <gs...@apache.org> wrote:
>
> The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems.
> For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)?  What algorithms worked for you?
> What didn't work?  Any and all insight is welcome and I don't 
> particularly care if it is Mahout specific (for instance, part of the 
> chapter is about search result clustering using Carrot2 and so Mahout 
> isn't applicable)
>

Let me start by saying Mahout works great for us. We can run k-means on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a single host.

Using vector normalization like L2 norm helped quite a bit. Thanks to Ted for this suggestion. In text clustering, you have lots of small documents. This results into very sparse vectors (total of 100K features with each vector having 200 features.) Using vanilla TFIDF weights doesn't work as nicely.

Even if we don't do explicit stop word removal, the threshold values for document count does that in a better fashion. If you exclude the features which are extremely common (say more than 40% documents) or extremely rare (say in less than 50 documents in a corpus of 100K docs), you have a meaningful set of features. The current K-Means already accepts these parameters.

Stemming can be used for feature reduction, but it has a minor issue.
When you want to find out prominent features of the resulting cluster centroid, the feature may not be meaningful. For example,  if a prominent feature is "beautiful", when you get it back, you will get "beauti." Ouch.

I tried fuzzy K-Means for soft clustering, but I didn't get good results. May be the corpus had the issue.

One observation about the clustering process is that it is geared, by accident or by design, towards batch processing. There is no support for real-time clustering. There needs to be glue which ties all the components together to make the process seamless. I suppose, someone in need of this feature will contribute it to Mahout.

Grant,  If I recall more, I will mail it to you.

--shashi

Re: Clustering techniques, tips and tricks

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 2, 2010, at 2:15 AM, Shashikant Kore wrote:

> On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> 
>> The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems.
>> For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)?  What algorithms worked for you?
>> What didn't work?  Any and all insight is welcome and I don't particularly care if it is Mahout specific (for instance, part of
>> the chapter is about search result clustering using Carrot2 and so Mahout isn't applicable)
>> 
> 
> 
> Using vector normalization like L2 norm helped quite a bit. 

As I recall, it is important that the choice of norms aligns with the choice of distance measures, as well as data source (http://www.lucidimagination.com/search/document/34ffc2a83a71a055/centroid_calculations_with_sparse_vectors and http://www.lucidimagination.com/search/document/34ffc2a83a71a055/centroid_calculations_with_sparse_vectors#3d8310376b6cdf6b)

Re: Clustering techniques, tips and tricks

Posted by Ian Holsman <li...@holsman.net>.

On 1/2/10 6:15 PM, Shashikant Kore wrote:
> On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll<gs...@apache.org>  wrote:
>    
>> The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems.
>> For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)?  What algorithms worked for you?
>> What didn't work?  Any and all insight is welcome and I don't particularly care if it is Mahout specific (for instance, part of
>> the chapter is about search result clustering using Carrot2 and so Mahout isn't applicable)
>>
>>      
> Let me start by saying Mahout works great for us. We can run k-means
> on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a
> single host.
>
> Using vector normalization like L2 norm helped quite a bit. Thanks to
> Ted for this suggestion. In text clustering, you have lots of small
> documents. This results into very sparse vectors (total of 100K
> features with each vector having 200 features.) Using vanilla TFIDF
> weights doesn't work as nicely.
>    

I'm not sure what L2 norm is, but wouldn't the frequent pattern mining 
feature help here?
(from mahout-157) I was hoping to use it for feature reduction.

> Even if we don't do explicit stop word removal, the threshold values
> for document count does that in a better fashion. If you exclude the
> features which are extremely common (say more than 40% documents) or
> extremely rare (say in less than 50 documents in a corpus of 100K
> docs), you have a meaningful set of features. The current K-Means
> already accepts these parameters.
>
> Stemming can be used for feature reduction, but it has a minor issue.
> When you want to find out prominent features of the resulting cluster
> centroid, the feature may not be meaningful. For example,  if a
> prominent feature is "beautiful", when you get it back, you will get
> "beauti." Ouch.
>
> I tried fuzzy K-Means for soft clustering, but I didn't get good
> results. May be the corpus had the issue.
>
> One observation about the clustering process is that it is geared, by
> accident or by design, towards batch processing. There is no
> support for real-time clustering. There needs to be glue which ties
> all the components together to make the process seamless. I suppose,
> someone in need of this feature will contribute it to Mahout.
>
> Grant,  If I recall more, I will mail it to you.
>
> --shashi
>
>

Re: Clustering techniques, tips and tricks

Posted by Shashikant Kore <sh...@gmail.com>.

On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <gs...@apache.org> wrote:
>
> The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems.
> For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)?  What algorithms worked for you?
> What didn't work?  Any and all insight is welcome and I don't particularly care if it is Mahout specific (for instance, part of
> the chapter is about search result clustering using Carrot2 and so Mahout isn't applicable)
>

Let me start by saying Mahout works great for us. We can run k-means
on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a
single host.

Using vector normalization like L2 norm helped quite a bit. Thanks to
Ted for this suggestion. In text clustering, you have lots of small
documents. This results into very sparse vectors (total of 100K
features with each vector having 200 features.) Using vanilla TFIDF
weights doesn't work as nicely.

Even if we don't do explicit stop word removal, the threshold values
for document count does that in a better fashion. If you exclude the
features which are extremely common (say more than 40% documents) or
extremely rare (say in less than 50 documents in a corpus of 100K
docs), you have a meaningful set of features. The current K-Means
already accepts these parameters.

Stemming can be used for feature reduction, but it has a minor issue.
When you want to find out prominent features of the resulting cluster
centroid, the feature may not be meaningful. For example,  if a
prominent feature is "beautiful", when you get it back, you will get
"beauti." Ouch.

I tried fuzzy K-Means for soft clustering, but I didn't get good
results. May be the corpus had the issue.

One observation about the clustering process is that it is geared, by
accident or by design, towards batch processing. There is no
support for real-time clustering. There needs to be glue which ties
all the components together to make the process seamless. I suppose,
someone in need of this feature will contribute it to Mahout.

Grant,  If I recall more, I will mail it to you.

--shashi