You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/01/03 20:04:47 UTC

Re: Methods for Naming Clusters

Resuscitating this again...

So, I committed MAHOUT-163 (thanks, Shashi!) which implements Ted's log likelihood ideas and I've been trying it out and also comparing it to what Carrot2 does for generating labels.  One of the things that I think would make sense is to extend MAHOUT-163 to have the option to return phrases instead of just terms.  My first thought is to just create an n-gram model of the same field I'm clustering on (as that will allow the existing code to work unmodified), but I wanted to hear what others think.  Is it worth the time?

I'm also interested in other approaches people have taken.

-Grant

On Sep 5, 2009, at 4:58 PM, Sebastien Bratieres wrote:

> Hi,
> 
> (I know this is an old topic -- but I am ressuscitating it on purpose !)
> 
> I've come across this article (Lafferty & Blei 2009)
> http://www.citeulike.org/user/maximzhao/article/5084329 which seems to build
> upon Ted's log likelihood ratio. The goal is exactly the original poster's
> question: how to characterize a topic cluster with its terms.
> Ted, I'd be interested in knowing your opinion on this article; most
> importantly, how easily it can be implemented and what improvement it brings
> over LLR.
> 
> I hope this can help people on the list who are busy with topic clustering !
> 
> Sebastien
> 
> 
> 2009/8/12 Shashikant Kore <sh...@gmail.com>
> 
>> I was referring to the condition where a phrase is identifies as good
>> by LLR and is also prominent feature of centroid.  But, as you
>> clarified, only LLR score is good indicator for top labels.
>> 
>> Thanks for the pointer for co-occurrence statistics. I will study some
>> literature on that.
>> 
>> --shashi
>> 
>> On Wed, Aug 12, 2009 at 11:23 PM, Ted Dunning<te...@gmail.com>
>> wrote:
>>> On Wed, Aug 12, 2009 at 6:12 AM, Shashikant Kore <shashikant@gmail.com
>>> wrote:
>>> 
>>>> 
>>>> Is this a necessary & sufficient  condition for a good cluster label?
>>> 
>>> 
>>> I am not entirely clear what "this" is.  My assertion is that high LLR
>> score
>>> is sufficient evidence to use the term or phrase.  I generally also limit
>>> the number of terms as well, taking only the highest scoring ones.  The
>>> necessary and sufficient phrase comes from a rigorous mathematical
>>> background that doesn't entirely apply here where we are talking about
>>> heuristics like this.
>>> 
>>> 
>>>> On a different note,  is there any way to identify relationship among
>>>> the top labels of the clusters? For example, if I have cluster related
>>>> automobiles, I may get the companies (GM, Ford, Toyota) along with
>>>> their poupular models (Corolla,  Cadillac, ) as top labels. How can I
>>>> figure out Toyota and Corolla are strongly related?
>>> 
>>> 
>>> Look at the co-occurrence statistics of the terms themselves.  Use that
>> to
>>> form a sparse graph.  Then do spectral clustering or agglomerative
>>> clustering on the graph.
>>> 
>>> That will give you clusters of terms that will give you much of what you
>>> seek.  Of course, the fact that the terms are being used to describe the
>>> same cluster means that you have a good chance of just replicating the
>> label
>>> sets for your clusters.
>>> 
>>> --
>>> Ted Dunning, CTO
>>> DeepDyve
>>> 
>>

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

Not really.

Chapter 7 of my dissertation compared LLR based relevance feedback
(unprojected except by common membership in the relevant set) versus
unprojected ML method (inQuery) versus a one-step learning method.  That is
the closest I know of and it isn't what you are looking for.

On Mon, Jan 4, 2010 at 3:37 PM, Jake Mannix <ja...@gmail.com> wrote:

>
> Do you know of any nice relevance comparisons of un-projected vs. one-step
> learning vs. random projection vs. svd?  For text or recommenders or
> anything
> else?

-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Posted by Jake Mannix <ja...@gmail.com>.

Fair enough.

I bet that, like many other situations in this field, it's highly
data-dependent.  For N x M matrices with d being the mean number of
nonempty
entries per row, if d is fairly small, and N and M grow to be very very
large,
the necessity for "densification" to take into account higher order
correlations
becomes more necessary.  For text, d may be small, but due to Zipf, there's
that meaty set of columns which has a pretty nice density (not stop-word
density,
but way denser than the (N/M)*d mean density), which might provide enough
support to do natural smoothing at one step of learning.

For web graphs and most social graphs, getting data beyond 2-degrees is
pretty critical for all of the lightly connected nodes (they don't see much
before
3 degrees).

Do you know of any nice relevance comparisons of un-projected vs. one-step
learning vs. random projection vs. svd?  For text or recommenders or
anything
else?

  -jake

On Mon, Jan 4, 2010 at 3:22 PM, Ted Dunning <te...@gmail.com> wrote:

> I agree with everything you say.
>
> Except,
>
> I have found that A'AR gives an awful lot of what you need and may even be
> better in some ways than a full SVD.
>
> The *assumption* that incorporating all n-th degree connections is better
> in
> terms of results is just that.  Whether it actually is better is a matter
> of
> conjecture and I certainly don't have really strong evidence either way.
> The random indexing community claims to have really good results.  The LSA
> community claims some good results, notably Netflix results.  My own
> experience in document classification is that the results with A'AR (what
> we
> used to call one-step learning) and SVD are really, really similar.
>
> On Mon, Jan 4, 2010 at 2:55 PM, Jake Mannix <ja...@gmail.com> wrote:
>
> > When you notice that for text, ngrams like "software engineer" are now
> > considerably closer to "c++ developer" than to other ngrams, this gives
> you
> > information.  You don't get that information from a random projection.
> > You'll get some of that information from A'AR, because you get
> second-order
> > correlations, but then you're still losing all correlations beyond
> > second-order (and a true eigenvector is getting you the full infinite
> > series of correlations,
> > properly weighted).
> >
> > I mean, I guess you can use SVD purely for dimensional reduction, but
> like
> > you say, doing reduction can be done lots of other more efficient ways.
> >  Doing it with reduction which enhances co-occurrence relationships and
> > distorts
> > the metric to produce better clusters than when you started is something
> > that SVD, NMF, and LDA were designed for.
> >
> > Maybe I'm missing your point?
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

I agree with everything you say.

Except,

I have found that A'AR gives an awful lot of what you need and may even be
better in some ways than a full SVD.

The *assumption* that incorporating all n-th degree connections is better in
terms of results is just that.  Whether it actually is better is a matter of
conjecture and I certainly don't have really strong evidence either way.
The random indexing community claims to have really good results.  The LSA
community claims some good results, notably Netflix results.  My own
experience in document classification is that the results with A'AR (what we
used to call one-step learning) and SVD are really, really similar.

On Mon, Jan 4, 2010 at 2:55 PM, Jake Mannix <ja...@gmail.com> wrote:

> When you notice that for text, ngrams like "software engineer" are now
> considerably closer to "c++ developer" than to other ngrams, this gives you
> information.  You don't get that information from a random projection.
> You'll get some of that information from A'AR, because you get second-order
> correlations, but then you're still losing all correlations beyond
> second-order (and a true eigenvector is getting you the full infinite
> series of correlations,
> properly weighted).
>
> I mean, I guess you can use SVD purely for dimensional reduction, but like
> you say, doing reduction can be done lots of other more efficient ways.
>  Doing it with reduction which enhances co-occurrence relationships and
> distorts
> the metric to produce better clusters than when you started is something
> that SVD, NMF, and LDA were designed for.
>
> Maybe I'm missing your point?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Posted by Jake Mannix <ja...@gmail.com>.

Hmm... the degree to which I've found SVD useful is primarily contained in
the amount to which the metric is *not* preserved, in my experience...
that's
the whole point, or else you get very little out of it: you trade a high
dimensional
sparse computation for a low-dimensional dense one, and if you exactly
preserved
the metric you basically get nothing.

When you notice that for text, ngrams like "software engineer" are now
considerably closer to "c++ developer" than to other ngrams, this gives you
information.  You don't get that information from a random projection.
 You'll
get some of that information from A'AR, because you get second-order
correlations, but then you're still losing all correlations beyond
second-order (and
a true eigenvector is getting you the full infinite series of correlations,
properly
weighted).

I mean, I guess you can use SVD purely for dimensional reduction, but like
you say, doing reduction can be done lots of other more efficient ways.
 Doing
it with reduction which enhances co-occurrence relationships and distorts
the metric to produce better clusters than when you started is something
that
SVD, NMF, and LDA were designed for.

Maybe I'm missing your point?

  -jake

On Mon, Jan 4, 2010 at 2:44 PM, Ted Dunning <te...@gmail.com> wrote:

> SVD is (approximately) metric-preserving while also dimensionality
> reducing.  If you use A'AR instead of the actual term eigenvectors you
> should get similar results.
>
> On Mon, Jan 4, 2010 at 2:21 PM, Jake Mannix <ja...@gmail.com> wrote:
>
> > Ted, how would just doing a random projection do the right thing?  It's a
> > basically metric-preserving technique, and one of the primary reasons to
> > *do* LSA is to use a *different* metric (one in which "similar" terms are
> > nearer to each other than would be otherwise imagined).
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

SVD is (approximately) metric-preserving while also dimensionality
reducing.  If you use A'AR instead of the actual term eigenvectors you
should get similar results.

On Mon, Jan 4, 2010 at 2:21 PM, Jake Mannix <ja...@gmail.com> wrote:

> Ted, how would just doing a random projection do the right thing?  It's a
> basically metric-preserving technique, and one of the primary reasons to
> *do* LSA is to use a *different* metric (one in which "similar" terms are
> nearer to each other than would be otherwise imagined).
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Posted by Jake Mannix <ja...@gmail.com>.

Ted, how would just doing a random projection do the right thing?  It's a
basically metric-preserving technique, and one of the primary reasons to
*do* LSA is to use a *different* metric (one in which "similar" terms are
nearer to each other than would be otherwise imagined).

I've always thought that the primary point of that survey article was to
demonstrate how to speed up the LSA, by turning the rank-K decomposition
of a sparse N x M matrix into instead the rank-K decomposition of a
dense (k+\delta)x(k+\delta) matrix (for \delta of order smaller than k)
after
doing a single KxN matrix multiplication (well, after in parallel sense).

You still want to do the decomposition, because this provides the proper
weighting for your dimensions.

  -jake

On Mon, Jan 4, 2010 at 2:12 PM, Ted Dunning <te...@gmail.com> wrote:

> Btw... relative to the cost of decomposition, have you seen the recent
> spate
> of articles on stochastic decomposition?  It can dramatically speed up LSA.
>
> See http://arxiv.org/abs/0909.4061v1 for a good survey.  My guess is that
> you don't even need to do the SVD and could just use a random projection
> with a single power step (which is nearly equivalent to random indexing).
>
> On Mon, Jan 4, 2010 at 11:57 AM, Dawid Weiss <da...@gmail.com>
> wrote:
>
> > We agree, it was just me explaining things vaguely. The bottom line
> > is: a lot depends on what you're planning to do with the clusters and
> > the methodology should be suitable to this.
> >
> > Dawid
> >
> >
> > On Mon, Jan 4, 2010 at 8:53 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > > I think I agree with this for clusters that are intended for human
> > > consumption, but I am sure that I disagree with this if you are looking
> > to
> > > use the clusters internally for machine learning purposes.
> > >
> > > The basic idea for the latter is that the distances to a bunch of
> > clusters
> > > can be used as a description of a point.  This description in terms of
> > > distances to cluster centroids can make some machine learning tasks
> > vastly
> > > easier.
> > >
> > > On Mon, Jan 4, 2010 at 11:44 AM, Dawid Weiss <da...@gmail.com>
> > wrote:
> > >
> > >> What's worse -- neither method is "better". We at Carrot2 have a
> > >> strong feeling that clusters should be described properly in order to
> > >> be useful, but one may argue that in many, many applications of
> > >> clustering, the labels are _not_ important and just individual
> > >> features of clusters (like keywords or even documents themselves) are
> > >> enough.
> > >>
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Methods for Naming Clusters

Posted by Dawid Weiss <da...@gmail.com>.

I haven't seen this paper, but I believe methods other than SVD may
yield similar results. This in fact was one of the inspirations for my
own PhD thesis -- I used a simple numerical clustering technique
instead of SVD with a similar outcome. I remember papers like this
one:

Inderjit S. Dhillon and Dharmendra S. Modha. Concept Decompositions
for Large Sparse Text
Data Using Clustering. Machine Learning, 42(1–2):143–175, 2001.

claiming that even simple clustering techniques allow you to
approximate matrix decompositions sensibly for the task of document
retrieval, for example. Interesting.

D.


On Mon, Jan 4, 2010 at 11:12 PM, Ted Dunning <te...@gmail.com> wrote:
> Btw... relative to the cost of decomposition, have you seen the recent spate
> of articles on stochastic decomposition?  It can dramatically speed up LSA.
>
> See http://arxiv.org/abs/0909.4061v1 for a good survey.  My guess is that
> you don't even need to do the SVD and could just use a random projection
> with a single power step (which is nearly equivalent to random indexing).
>
> On Mon, Jan 4, 2010 at 11:57 AM, Dawid Weiss <da...@gmail.com> wrote:
>
>> We agree, it was just me explaining things vaguely. The bottom line
>> is: a lot depends on what you're planning to do with the clusters and
>> the methodology should be suitable to this.
>>
>> Dawid
>>
>>
>> On Mon, Jan 4, 2010 at 8:53 PM, Ted Dunning <te...@gmail.com> wrote:
>> > I think I agree with this for clusters that are intended for human
>> > consumption, but I am sure that I disagree with this if you are looking
>> to
>> > use the clusters internally for machine learning purposes.
>> >
>> > The basic idea for the latter is that the distances to a bunch of
>> clusters
>> > can be used as a description of a point.  This description in terms of
>> > distances to cluster centroids can make some machine learning tasks
>> vastly
>> > easier.
>> >
>> > On Mon, Jan 4, 2010 at 11:44 AM, Dawid Weiss <da...@gmail.com>
>> wrote:
>> >
>> >> What's worse -- neither method is "better". We at Carrot2 have a
>> >> strong feeling that clusters should be described properly in order to
>> >> be useful, but one may argue that in many, many applications of
>> >> clustering, the labels are _not_ important and just individual
>> >> features of clusters (like keywords or even documents themselves) are
>> >> enough.
>> >>
>> >
>> >
>> >
>> > --
>> > Ted Dunning, CTO
>> > DeepDyve
>> >
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

Btw... relative to the cost of decomposition, have you seen the recent spate
of articles on stochastic decomposition?  It can dramatically speed up LSA.

See http://arxiv.org/abs/0909.4061v1 for a good survey.  My guess is that
you don't even need to do the SVD and could just use a random projection
with a single power step (which is nearly equivalent to random indexing).

On Mon, Jan 4, 2010 at 11:57 AM, Dawid Weiss <da...@gmail.com> wrote:

> We agree, it was just me explaining things vaguely. The bottom line
> is: a lot depends on what you're planning to do with the clusters and
> the methodology should be suitable to this.
>
> Dawid
>
>
> On Mon, Jan 4, 2010 at 8:53 PM, Ted Dunning <te...@gmail.com> wrote:
> > I think I agree with this for clusters that are intended for human
> > consumption, but I am sure that I disagree with this if you are looking
> to
> > use the clusters internally for machine learning purposes.
> >
> > The basic idea for the latter is that the distances to a bunch of
> clusters
> > can be used as a description of a point.  This description in terms of
> > distances to cluster centroids can make some machine learning tasks
> vastly
> > easier.
> >
> > On Mon, Jan 4, 2010 at 11:44 AM, Dawid Weiss <da...@gmail.com>
> wrote:
> >
> >> What's worse -- neither method is "better". We at Carrot2 have a
> >> strong feeling that clusters should be described properly in order to
> >> be useful, but one may argue that in many, many applications of
> >> clustering, the labels are _not_ important and just individual
> >> features of clusters (like keywords or even documents themselves) are
> >> enough.
> >>
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Posted by Dawid Weiss <da...@gmail.com>.

We agree, it was just me explaining things vaguely. The bottom line
is: a lot depends on what you're planning to do with the clusters and
the methodology should be suitable to this.

Dawid


On Mon, Jan 4, 2010 at 8:53 PM, Ted Dunning <te...@gmail.com> wrote:
> I think I agree with this for clusters that are intended for human
> consumption, but I am sure that I disagree with this if you are looking to
> use the clusters internally for machine learning purposes.
>
> The basic idea for the latter is that the distances to a bunch of clusters
> can be used as a description of a point.  This description in terms of
> distances to cluster centroids can make some machine learning tasks vastly
> easier.
>
> On Mon, Jan 4, 2010 at 11:44 AM, Dawid Weiss <da...@gmail.com> wrote:
>
>> What's worse -- neither method is "better". We at Carrot2 have a
>> strong feeling that clusters should be described properly in order to
>> be useful, but one may argue that in many, many applications of
>> clustering, the labels are _not_ important and just individual
>> features of clusters (like keywords or even documents themselves) are
>> enough.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

I think I agree with this for clusters that are intended for human
consumption, but I am sure that I disagree with this if you are looking to
use the clusters internally for machine learning purposes.

The basic idea for the latter is that the distances to a bunch of clusters
can be used as a description of a point.  This description in terms of
distances to cluster centroids can make some machine learning tasks vastly
easier.

On Mon, Jan 4, 2010 at 11:44 AM, Dawid Weiss <da...@gmail.com> wrote:

> What's worse -- neither method is "better". We at Carrot2 have a
> strong feeling that clusters should be described properly in order to
> be useful, but one may argue that in many, many applications of
> clustering, the labels are _not_ important and just individual
> features of clusters (like keywords or even documents themselves) are
> enough.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Posted by Dawid Weiss <da...@gmail.com>.

> 1. Top terms based on weight (e.g. TFIDF) -- Implemented in Mahout in the ClusterDumper - Just sort the top terms across the docs in the cluster and spit out some subset
> 2. Log-likelihood Ratio (LLR) - Implemented in Mahout in ClusterLabels, currently requires Lucene index, but could be modified - Calculates the log-likelihood for the terms in the vectors and then sorts based on the value and spits out the labels
> 3. Some type of LSA/SVD approach - Implemented in Carrot2, others - Identify the concepts by taking the SVD of the vectors and then determine/use the base concepts derived from that
> 4. Frequent Phrases using something like a suffix tree or other phrase detection methods - Implemented in in Carrot2 (Suffix Tree Clustering) and others - finds frequent phrases and sorts them based on a weight to return

Just a short follow-up note to this discussion.

The major difference is perhaps not in how you select labels, but how
you approach clustering AND labelling together -- whether you want to
find labels for existing groups of items or (which is what Carrot2
does) whether you want to find good potential cluster labels first and
then reject (or select) these labels that actually correspond to the
groups of documents found in the input. The actual algorithmic method
follows either of these paradigms.

I realize the difference may seem small at first, but it has huge
consequences in what you consider a cluster, clustering quality, etc.
Consider these facts:

- Clusters for which there are no good labels can (and probably
should) be discarded. We simply cannot convey their meaning to the
user, so why bother.

- In the second approach the structure of clusters is strongly
affected by the labels and the mechanism of assigning documents to
these labels. If you keep it strict (e.g. "every item in the cluster
must contain the cluster's exact phrase"), then the relationship
between a cluster's label and its content will directly convey
something to the user. If you allow some fuzziness (e.g., "every iterm
in the cluster must contain at least one word from the set of keywords
forming the cluster's label"), then the label is only a hint and the
user must still come up with the "glue" (whatever it is that caused
these particular keywords to be grouped together).

What's worse -- neither method is "better". We at Carrot2 have a
strong feeling that clusters should be described properly in order to
be useful, but one may argue that in many, many applications of
clustering, the labels are _not_ important and just individual
features of clusters (like keywords or even documents themselves) are
enough.

Back to the labelling methods Grant described -- STC and Lingo (from
Carrot2) are representatives of the "labels come first" apprach and
their usefulness will be limited in labelling existing document
clusters. STC, for example, is basically a complex feature extraction
algorithm (extraction of frequent phrases using suffix trees),
followed by a simple single-link greedy heuristic merging the document
sets implied by these features. Whether this approach can be scaled up
to labelling existing clusters -- I don't know.

What is certainly worth investigation is extraction of better
features. n-grams are a good candidate (especially with pruning
techniques). Suffix arrays could be used to speed up the process to
some extent (batching a good sample of documents in a single mapper,
running frequent phrase detection, emitting only frequent phrases that
exceed a given threshold in the batch). If you'd like to experiment
with this approach, suffix tree code is in Carrot2 and suffix arrays
are here: http://www.jsuffixarrays.org

I don't know if it helps in any way. Apologies if I've been
inconsistent in any of the above -- had about a dozen urgent (and I
mean: urgent) interrupts from my two-year-old.

Dawid

Re: Methods for Naming Clusters

Posted by Dawid Weiss <da...@gmail.com>.

> I don't know what Carrot2 is doing, but this often involves scanning phrases

We don't do much, really. If somebody is interested, let me know, I'll
send you a pdf of this:

Stanisław Osiński, Dawid Weiss: A Concept-Driven Algorithm for
Clustering Search Results. IEEE Intelligent Systems, May/June, 3 (vol.
20), 2005, pp. 48—54.

> from the documents to find the ones most saliently related to the cluster
> centroid.  It can have very nice results, especially if you add some sense
> of how unusual the phrase is in the corpus.

Yes, this is one possibility. You need to be careful to avoid
mismatches -- keywords that are part of the phrase that would be
misleading as ot the actual content of the cluster. My suggestion to
consider _phrases_ as features instead of single words still holds.
The sparsity of the input representation grows a lot, but the results
are much easier to interpret.

> This is also related to 1 and 2 except that it uses phrases and basically
> blows off the IDF part of the weighting (which is plausible since almost all
> phrases are pretty rare).  It is subject to problems where you get fixed
> phrases that proliferate through the corpus.  My own nasty experience was
> with the phrase "Staff writer of the Wall Street Journal" which seemed
> highly significant for several articles (for a moment).

In the original paper Zamir and Etzioni suggest to filter out phrases
that are too long, start or end with common words, etc. (there are
more heuristics). This works remarkably well, although your comment is
valid -- getting rid of such phrases is hard (but on the other hand --
so it is with junk keywords).

> In my recent experience, suffix trees are often over-kill since you can
> count all the phrases in a small document set (a cluster) very, very
> quickly.  You don't even usually need to look at all of the members of a
> cluster, just the few dozen closest to the centroid.

Suffix trees -- an overkill, but the task is quickly and efficiently
done using suffix arrays and then it may be worth the effort. The
discovery method does not change the final result, so naive methods
are usually best to start with to see if you can gain anything.

Dawid

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

On Sun, Jan 3, 2010 at 1:47 PM, Grant Ingersoll <gs...@apache.org> wrote:

> Just to add a little bit based on some research I've been doing on the
> subject, it seems there are several techniques for naming clusters, ranging
> from the mundane to the intricate:
>
> 1. Top terms based on weight (e.g. TFIDF) -- Implemented in Mahout in the
> ClusterDumper - Just sort the top terms across the docs in the cluster and
> spit out some subset
> 2. Log-likelihood Ratio (LLR) - Implemented in Mahout in ClusterLabels,
> currently requires Lucene index, but could be modified - Calculates the
> log-likelihood for the terms in the vectors and then sorts based on the
> value and spits out the labels
>

These two are actually closely related when you look at the math involved.
LLR has a few extra terms that help avoid some bad terms.

The LLR method can be extended to use a corpus bigram language model versus
a cluster bigram language model.  This requires counting bigrams over the
entire corpus as well as for the clusters.  If that is feasible, it can give
good results.

> 3. Some type of LSA/SVD approach - Implemented in Carrot2, others -
> Identify the concepts by taking the SVD of the vectors and then
> determine/use the base concepts derived from that
>

I don't know what Carrot2 is doing, but this often involves scanning phrases
from the documents to find the ones most saliently related to the cluster
centroid.  It can have very nice results, especially if you add some sense
of how unusual the phrase is in the corpus.

> 4. Frequent Phrases using something like a suffix tree or other phrase
> detection methods - Implemented in in Carrot2 (Suffix Tree Clustering) and
> others - finds frequent phrases and sorts them based on a weight to return
>

This is also related to 1 and 2 except that it uses phrases and basically
blows off the IDF part of the weighting (which is plausible since almost all
phrases are pretty rare).  It is subject to problems where you get fixed
phrases that proliferate through the corpus.  My own nasty experience was
with the phrase "Staff writer of the Wall Street Journal" which seemed
highly significant for several articles (for a moment).

In my recent experience, suffix trees are often over-kill since you can
count all the phrases in a small document set (a cluster) very, very
quickly.  You don't even usually need to look at all of the members of a
cluster, just the few dozen closest to the centroid.

> I'm probably missing some other approaches so feel free to fill in, but
> those are what I've come across so far.
>
> -Grant
>
> On Jan 3, 2010, at 3:07 PM, Ted Dunning wrote:
>
> > Good thing to do.
> >
> > Slightly tricky to do.  But worthy.
> >
> > On Sun, Jan 3, 2010 at 11:04 AM, Grant Ingersoll <gsingers@apache.org
> >wrote:
> >
> >> My first thought is to just create an n-gram model of the same field I'm
> >> clustering on (as that will allow the existing code to work unmodified),
> but
> >> I wanted to hear what others think.  Is it worth the time?
> >>
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
>
>
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Posted by Grant Ingersoll <gs...@apache.org>.

Just to add a little bit based on some research I've been doing on the subject, it seems there are several techniques for naming clusters, ranging from the mundane to the intricate:

1. Top terms based on weight (e.g. TFIDF) -- Implemented in Mahout in the ClusterDumper - Just sort the top terms across the docs in the cluster and spit out some subset
2. Log-likelihood Ratio (LLR) - Implemented in Mahout in ClusterLabels, currently requires Lucene index, but could be modified - Calculates the log-likelihood for the terms in the vectors and then sorts based on the value and spits out the labels
3. Some type of LSA/SVD approach - Implemented in Carrot2, others - Identify the concepts by taking the SVD of the vectors and then determine/use the base concepts derived from that
4. Frequent Phrases using something like a suffix tree or other phrase detection methods - Implemented in in Carrot2 (Suffix Tree Clustering) and others - finds frequent phrases and sorts them based on a weight to return

I'm probably missing some other approaches so feel free to fill in, but those are what I've come across so far.

-Grant

On Jan 3, 2010, at 3:07 PM, Ted Dunning wrote:

> Good thing to do.
> 
> Slightly tricky to do.  But worthy.
> 
> On Sun, Jan 3, 2010 at 11:04 AM, Grant Ingersoll <gs...@apache.org>wrote:
> 
>> My first thought is to just create an n-gram model of the same field I'm
>> clustering on (as that will allow the existing code to work unmodified), but
>> I wanted to hear what others think.  Is it worth the time?
>> 
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

Good thing to do.

Slightly tricky to do.  But worthy.

On Sun, Jan 3, 2010 at 11:04 AM, Grant Ingersoll <gs...@apache.org>wrote:

> My first thought is to just create an n-gram model of the same field I'm
> clustering on (as that will allow the existing code to work unmodified), but
> I wanted to hear what others think.  Is it worth the time?
>

-- 
Ted Dunning, CTO
DeepDyve