You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Paul Ingles <pa...@oobaloo.co.uk> on 2009/08/05 21:38:56 UTC

Methods for Naming Clusters

Hi,

As I've mentioned in the past, I'm working on clustering documents  
(albeit relatively small ones). The cluster mechanism I've ended up  
with has produced some pretty good results (at least for what I need  
to be able to do). However, what I'd like to be able to do is find a  
way to automate the naming of these groups.

For example, if each document has a 6/7 word title, I'd like to  
produce names that are somewhat logically ordered (that is they make  
grammatical sense, this can probably be inferred by the frequency in  
the clusters: most documents in a cluster should be well-formed) and  
share terms across the majority of the titles.

So far, I'm using a kind of hacked-together longest common substring  
method:

* Sort the titles within the cluster
* Compare every string against every other string, producing a LCS value
* Use the most common LCS

As this is all relatively new ground for me, I was wondering whether  
there were any better methods I could be using?

Thanks,
Paul

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

The problem was not in your corpus, but in your test (to misquote Marc
Antony).

See my surprise and coincidence paper that shows why chi-squared tests are
evil <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.2186>for
this sort of application.  The short answer is that they over-estimate how
excited you should be by as much as 300 orders of magnitude.

On Wed, Aug 5, 2009 at 3:26 PM, Tanton Gibbs <ta...@gmail.com> wrote:

> The problem was that the collection
> was so large that ANY repeated connection looked statistically
> significant (I was using chi-squares).  I eventually had to apply a
> cutoff, but I wonder if there was a more elegant way to do it.  I
> realize this is not the same thing as the OP's question - hope you
> don't mind :)
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Posted by Tanton Gibbs <ta...@gmail.com>.

> A nice way to do that is the log-likelihood ratio test that I use for
> everything under the sun.  This would consider in-cluster and out-of-cluster
> as two classes and would consider the frequency of each possible term or
> phrase in these two classes.  This will give you words and phrases that are
> anomalously common in your cluster and relatively rare outside it.

This seems to rely on the data source to be homogenous in some
respect.  For instance, your wall street journal source was fairly
consistent wrt the author.  What about those sources that lack
homogeneity?  For instance, I once tried to apply something similar to
this to automatic determination of nicknames.  I had a large corpus of
names and their connections so I could tell that Bill and William
belonged to the same individual.  The problem was that the collection
was so large that ANY repeated connection looked statistically
significant (I was using chi-squares).  I eventually had to apply a
cutoff, but I wonder if there was a more elegant way to do it.  I
realize this is not the same thing as the OP's question - hope you
don't mind :)

Tanton

Re: Methods for Naming Clusters

Posted by Grant Ingersoll <gs...@apache.org>.

On Aug 12, 2009, at 9:51 AM, Shashikant Kore wrote:

> BTW, I will clean up the code and upload a patch soon.

+1

Re: Methods for Naming Clusters

Posted by Shashikant Kore <sh...@gmail.com>.

BTW, I will clean up the code and upload a patch soon.

--shashi

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

Thinking about this a bit more, it occurs to me that this paper only
analyzes the corpus for collocations that are detectable in all contexts.
One major strength of LDA is that it allows for polysemy.  As such, it
should be possible to find collocations that are limited to particular topic
contexts much more sensitively than is possible when ignoring context.

Topic tagging of words also has a strong potential for language modeling and
retrieval in general exactly because it allows for good sense
disambiguation.

On Sat, Sep 5, 2009 at 4:27 PM, Ted Dunning <te...@gmail.com> wrote:

>
>
> On Sat, Sep 5, 2009 at 1:58 PM, Sebastien Bratieres <sb...@cam.ac.uk>wrote:
>
>> I've come across this article (Lafferty & Blei 2009)
>> http://www.citeulike.org/user/maximzhao/article/5084329 which seems to
>> build
>> upon Ted's log likelihood ratio.
>
>
> Yes.  And they do a good job of it.  Their backoff model is essentially
> identical to one that I proposed in my dissertation, but the permutation
> test is a good addition for deciding which LR's useful in finding new
> n-grams for the model.  The permutation test that they propose is also very
> similar to the one that I used for analyzing genetic sequence data with
> LR's, although in very different context.
>

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

On Sat, Sep 5, 2009 at 1:58 PM, Sebastien Bratieres <sb...@cam.ac.uk> wrote:

> I've come across this article (Lafferty & Blei 2009)
> http://www.citeulike.org/user/maximzhao/article/5084329 which seems to
> build
> upon Ted's log likelihood ratio.

Yes.  And they do a good job of it.  Their backoff model is essentially
identical to one that I proposed in my dissertation, but the permutation
test is a good addition for deciding which LR's useful in finding new
n-grams for the model.  The permutation test that they propose is also very
similar to the one that I used for analyzing genetic sequence data with
LR's, although in very different context.

> Ted, I'd be interested in knowing your opinion on this article; most
> importantly, how easily it can be implemented and what improvement it
> brings
> over LLR.
>

That really depends a lot on many factors.  The primary factor is how much
data you have to work with.  That scale has two effects.  First, it makes
fancy techniques much harder to apply.  Secondly, it makes simpler
techniques work better and better.  LLR's are very hard to beat in terms of
complexity.

In general, I will pretty much always opt for LLR computed independently for
all collocations simultaneously.  This definitely has the defects described
in the paper, but you can do the computation for all n-gram extensions with
at worse a single pass over the data.  That means that you need at most a
few passes (typically 7 or less).  There are also clever data structures
that can be built pretty economically that let you do all computations with
a single pass.  These clever structures, however, are difficult to scale.
If your data is moderate in size (say wikipiedia), passes over the data only
take a few minutes.  For large clusters and web-scale corpora, you can pass
over the data in a few hours.

So to start, I just use the simplest possible implementation and evaluate
the results with users.  If the results are good enough, then I am done.

The next step is generally to use LR mostly as described in this paper.  The
only change is that I would recommend using a threshold to incorporate large
numbers of n-grams at a time into the model.  This minimizes the number of
passes over the data, but you will definitely have a longer process.  This
will likely pull in higher quality phrases than the simpler model at some
computational cost.

Finally, I would consider more elaborate processes such as Lafferty and Blei
suggest.  The real problem here is that they don't specify quite what they
do.  They give a constraint (permute, preserving known n-gram dependencies),
but not an exact procedure.  As such, it will be hard to replicate the
process without asking a few questions.

In practice, though, I would spend much more effort on good topic modeling.
In my experience, decent phrase discover is not discernibly worse than great
phrase discover, at least viewed on a scale determined by the difference
between good topic discovery and bad topic discovery.  Since I am usually
doing this work at a startup, development time is massively limited and
best-bang is the only feasible scheduling algorithm.

The next step for Mahout to explore any of this is to build a good
co-occurrence counter.

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

Not really.

Chapter 7 of my dissertation compared LLR based relevance feedback
(unprojected except by common membership in the relevant set) versus
unprojected ML method (inQuery) versus a one-step learning method.  That is
the closest I know of and it isn't what you are looking for.

On Mon, Jan 4, 2010 at 3:37 PM, Jake Mannix <ja...@gmail.com> wrote:

>
> Do you know of any nice relevance comparisons of un-projected vs. one-step
> learning vs. random projection vs. svd?  For text or recommenders or
> anything
> else?

-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Posted by Jake Mannix <ja...@gmail.com>.

Fair enough.

I bet that, like many other situations in this field, it's highly
data-dependent.  For N x M matrices with d being the mean number of
nonempty
entries per row, if d is fairly small, and N and M grow to be very very
large,
the necessity for "densification" to take into account higher order
correlations
becomes more necessary.  For text, d may be small, but due to Zipf, there's
that meaty set of columns which has a pretty nice density (not stop-word
density,
but way denser than the (N/M)*d mean density), which might provide enough
support to do natural smoothing at one step of learning.

For web graphs and most social graphs, getting data beyond 2-degrees is
pretty critical for all of the lightly connected nodes (they don't see much
before
3 degrees).

Do you know of any nice relevance comparisons of un-projected vs. one-step
learning vs. random projection vs. svd?  For text or recommenders or
anything
else?

  -jake

On Mon, Jan 4, 2010 at 3:22 PM, Ted Dunning <te...@gmail.com> wrote:

> I agree with everything you say.
>
> Except,
>
> I have found that A'AR gives an awful lot of what you need and may even be
> better in some ways than a full SVD.
>
> The *assumption* that incorporating all n-th degree connections is better
> in
> terms of results is just that.  Whether it actually is better is a matter
> of
> conjecture and I certainly don't have really strong evidence either way.
> The random indexing community claims to have really good results.  The LSA
> community claims some good results, notably Netflix results.  My own
> experience in document classification is that the results with A'AR (what
> we
> used to call one-step learning) and SVD are really, really similar.
>
> On Mon, Jan 4, 2010 at 2:55 PM, Jake Mannix <ja...@gmail.com> wrote:
>
> > When you notice that for text, ngrams like "software engineer" are now
> > considerably closer to "c++ developer" than to other ngrams, this gives
> you
> > information.  You don't get that information from a random projection.
> > You'll get some of that information from A'AR, because you get
> second-order
> > correlations, but then you're still losing all correlations beyond
> > second-order (and a true eigenvector is getting you the full infinite
> > series of correlations,
> > properly weighted).
> >
> > I mean, I guess you can use SVD purely for dimensional reduction, but
> like
> > you say, doing reduction can be done lots of other more efficient ways.
> >  Doing it with reduction which enhances co-occurrence relationships and
> > distorts
> > the metric to produce better clusters than when you started is something
> > that SVD, NMF, and LDA were designed for.
> >
> > Maybe I'm missing your point?
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

I agree with everything you say.

Except,

I have found that A'AR gives an awful lot of what you need and may even be
better in some ways than a full SVD.

The *assumption* that incorporating all n-th degree connections is better in
terms of results is just that.  Whether it actually is better is a matter of
conjecture and I certainly don't have really strong evidence either way.
The random indexing community claims to have really good results.  The LSA
community claims some good results, notably Netflix results.  My own
experience in document classification is that the results with A'AR (what we
used to call one-step learning) and SVD are really, really similar.

On Mon, Jan 4, 2010 at 2:55 PM, Jake Mannix <ja...@gmail.com> wrote:

> When you notice that for text, ngrams like "software engineer" are now
> considerably closer to "c++ developer" than to other ngrams, this gives you
> information.  You don't get that information from a random projection.
> You'll get some of that information from A'AR, because you get second-order
> correlations, but then you're still losing all correlations beyond
> second-order (and a true eigenvector is getting you the full infinite
> series of correlations,
> properly weighted).
>
> I mean, I guess you can use SVD purely for dimensional reduction, but like
> you say, doing reduction can be done lots of other more efficient ways.
>  Doing it with reduction which enhances co-occurrence relationships and
> distorts
> the metric to produce better clusters than when you started is something
> that SVD, NMF, and LDA were designed for.
>
> Maybe I'm missing your point?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Posted by Jake Mannix <ja...@gmail.com>.

Hmm... the degree to which I've found SVD useful is primarily contained in
the amount to which the metric is *not* preserved, in my experience...
that's
the whole point, or else you get very little out of it: you trade a high
dimensional
sparse computation for a low-dimensional dense one, and if you exactly
preserved
the metric you basically get nothing.

When you notice that for text, ngrams like "software engineer" are now
considerably closer to "c++ developer" than to other ngrams, this gives you
information.  You don't get that information from a random projection.
 You'll
get some of that information from A'AR, because you get second-order
correlations, but then you're still losing all correlations beyond
second-order (and
a true eigenvector is getting you the full infinite series of correlations,
properly
weighted).

I mean, I guess you can use SVD purely for dimensional reduction, but like
you say, doing reduction can be done lots of other more efficient ways.
 Doing
it with reduction which enhances co-occurrence relationships and distorts
the metric to produce better clusters than when you started is something
that
SVD, NMF, and LDA were designed for.

Maybe I'm missing your point?

  -jake

On Mon, Jan 4, 2010 at 2:44 PM, Ted Dunning <te...@gmail.com> wrote:

> SVD is (approximately) metric-preserving while also dimensionality
> reducing.  If you use A'AR instead of the actual term eigenvectors you
> should get similar results.
>
> On Mon, Jan 4, 2010 at 2:21 PM, Jake Mannix <ja...@gmail.com> wrote:
>
> > Ted, how would just doing a random projection do the right thing?  It's a
> > basically metric-preserving technique, and one of the primary reasons to
> > *do* LSA is to use a *different* metric (one in which "similar" terms are
> > nearer to each other than would be otherwise imagined).
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

SVD is (approximately) metric-preserving while also dimensionality
reducing.  If you use A'AR instead of the actual term eigenvectors you
should get similar results.

On Mon, Jan 4, 2010 at 2:21 PM, Jake Mannix <ja...@gmail.com> wrote:

> Ted, how would just doing a random projection do the right thing?  It's a
> basically metric-preserving technique, and one of the primary reasons to
> *do* LSA is to use a *different* metric (one in which "similar" terms are
> nearer to each other than would be otherwise imagined).
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Posted by Jake Mannix <ja...@gmail.com>.

Ted, how would just doing a random projection do the right thing?  It's a
basically metric-preserving technique, and one of the primary reasons to
*do* LSA is to use a *different* metric (one in which "similar" terms are
nearer to each other than would be otherwise imagined).

I've always thought that the primary point of that survey article was to
demonstrate how to speed up the LSA, by turning the rank-K decomposition
of a sparse N x M matrix into instead the rank-K decomposition of a
dense (k+\delta)x(k+\delta) matrix (for \delta of order smaller than k)
after
doing a single KxN matrix multiplication (well, after in parallel sense).

You still want to do the decomposition, because this provides the proper
weighting for your dimensions.

  -jake

On Mon, Jan 4, 2010 at 2:12 PM, Ted Dunning <te...@gmail.com> wrote:

> Btw... relative to the cost of decomposition, have you seen the recent
> spate
> of articles on stochastic decomposition?  It can dramatically speed up LSA.
>
> See http://arxiv.org/abs/0909.4061v1 for a good survey.  My guess is that
> you don't even need to do the SVD and could just use a random projection
> with a single power step (which is nearly equivalent to random indexing).
>
> On Mon, Jan 4, 2010 at 11:57 AM, Dawid Weiss <da...@gmail.com>
> wrote:
>
> > We agree, it was just me explaining things vaguely. The bottom line
> > is: a lot depends on what you're planning to do with the clusters and
> > the methodology should be suitable to this.
> >
> > Dawid
> >
> >
> > On Mon, Jan 4, 2010 at 8:53 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > > I think I agree with this for clusters that are intended for human
> > > consumption, but I am sure that I disagree with this if you are looking
> > to
> > > use the clusters internally for machine learning purposes.
> > >
> > > The basic idea for the latter is that the distances to a bunch of
> > clusters
> > > can be used as a description of a point.  This description in terms of
> > > distances to cluster centroids can make some machine learning tasks
> > vastly
> > > easier.
> > >
> > > On Mon, Jan 4, 2010 at 11:44 AM, Dawid Weiss <da...@gmail.com>
> > wrote:
> > >
> > >> What's worse -- neither method is "better". We at Carrot2 have a
> > >> strong feeling that clusters should be described properly in order to
> > >> be useful, but one may argue that in many, many applications of
> > >> clustering, the labels are _not_ important and just individual
> > >> features of clusters (like keywords or even documents themselves) are
> > >> enough.
> > >>
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Methods for Naming Clusters

Posted by Dawid Weiss <da...@gmail.com>.

I haven't seen this paper, but I believe methods other than SVD may
yield similar results. This in fact was one of the inspirations for my
own PhD thesis -- I used a simple numerical clustering technique
instead of SVD with a similar outcome. I remember papers like this
one:

Inderjit S. Dhillon and Dharmendra S. Modha. Concept Decompositions
for Large Sparse Text
Data Using Clustering. Machine Learning, 42(1–2):143–175, 2001.

claiming that even simple clustering techniques allow you to
approximate matrix decompositions sensibly for the task of document
retrieval, for example. Interesting.

D.


On Mon, Jan 4, 2010 at 11:12 PM, Ted Dunning <te...@gmail.com> wrote:
> Btw... relative to the cost of decomposition, have you seen the recent spate
> of articles on stochastic decomposition?  It can dramatically speed up LSA.
>
> See http://arxiv.org/abs/0909.4061v1 for a good survey.  My guess is that
> you don't even need to do the SVD and could just use a random projection
> with a single power step (which is nearly equivalent to random indexing).
>
> On Mon, Jan 4, 2010 at 11:57 AM, Dawid Weiss <da...@gmail.com> wrote:
>
>> We agree, it was just me explaining things vaguely. The bottom line
>> is: a lot depends on what you're planning to do with the clusters and
>> the methodology should be suitable to this.
>>
>> Dawid
>>
>>
>> On Mon, Jan 4, 2010 at 8:53 PM, Ted Dunning <te...@gmail.com> wrote:
>> > I think I agree with this for clusters that are intended for human
>> > consumption, but I am sure that I disagree with this if you are looking
>> to
>> > use the clusters internally for machine learning purposes.
>> >
>> > The basic idea for the latter is that the distances to a bunch of
>> clusters
>> > can be used as a description of a point.  This description in terms of
>> > distances to cluster centroids can make some machine learning tasks
>> vastly
>> > easier.
>> >
>> > On Mon, Jan 4, 2010 at 11:44 AM, Dawid Weiss <da...@gmail.com>
>> wrote:
>> >
>> >> What's worse -- neither method is "better". We at Carrot2 have a
>> >> strong feeling that clusters should be described properly in order to
>> >> be useful, but one may argue that in many, many applications of
>> >> clustering, the labels are _not_ important and just individual
>> >> features of clusters (like keywords or even documents themselves) are
>> >> enough.
>> >>
>> >
>> >
>> >
>> > --
>> > Ted Dunning, CTO
>> > DeepDyve
>> >
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

Btw... relative to the cost of decomposition, have you seen the recent spate
of articles on stochastic decomposition?  It can dramatically speed up LSA.

See http://arxiv.org/abs/0909.4061v1 for a good survey.  My guess is that
you don't even need to do the SVD and could just use a random projection
with a single power step (which is nearly equivalent to random indexing).

On Mon, Jan 4, 2010 at 11:57 AM, Dawid Weiss <da...@gmail.com> wrote:

> We agree, it was just me explaining things vaguely. The bottom line
> is: a lot depends on what you're planning to do with the clusters and
> the methodology should be suitable to this.
>
> Dawid
>
>
> On Mon, Jan 4, 2010 at 8:53 PM, Ted Dunning <te...@gmail.com> wrote:
> > I think I agree with this for clusters that are intended for human
> > consumption, but I am sure that I disagree with this if you are looking
> to
> > use the clusters internally for machine learning purposes.
> >
> > The basic idea for the latter is that the distances to a bunch of
> clusters
> > can be used as a description of a point.  This description in terms of
> > distances to cluster centroids can make some machine learning tasks
> vastly
> > easier.
> >
> > On Mon, Jan 4, 2010 at 11:44 AM, Dawid Weiss <da...@gmail.com>
> wrote:
> >
> >> What's worse -- neither method is "better". We at Carrot2 have a
> >> strong feeling that clusters should be described properly in order to
> >> be useful, but one may argue that in many, many applications of
> >> clustering, the labels are _not_ important and just individual
> >> features of clusters (like keywords or even documents themselves) are
> >> enough.
> >>
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Posted by Dawid Weiss <da...@gmail.com>.

We agree, it was just me explaining things vaguely. The bottom line
is: a lot depends on what you're planning to do with the clusters and
the methodology should be suitable to this.

Dawid


On Mon, Jan 4, 2010 at 8:53 PM, Ted Dunning <te...@gmail.com> wrote:
> I think I agree with this for clusters that are intended for human
> consumption, but I am sure that I disagree with this if you are looking to
> use the clusters internally for machine learning purposes.
>
> The basic idea for the latter is that the distances to a bunch of clusters
> can be used as a description of a point.  This description in terms of
> distances to cluster centroids can make some machine learning tasks vastly
> easier.
>
> On Mon, Jan 4, 2010 at 11:44 AM, Dawid Weiss <da...@gmail.com> wrote:
>
>> What's worse -- neither method is "better". We at Carrot2 have a
>> strong feeling that clusters should be described properly in order to
>> be useful, but one may argue that in many, many applications of
>> clustering, the labels are _not_ important and just individual
>> features of clusters (like keywords or even documents themselves) are
>> enough.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

I think I agree with this for clusters that are intended for human
consumption, but I am sure that I disagree with this if you are looking to
use the clusters internally for machine learning purposes.

The basic idea for the latter is that the distances to a bunch of clusters
can be used as a description of a point.  This description in terms of
distances to cluster centroids can make some machine learning tasks vastly
easier.

On Mon, Jan 4, 2010 at 11:44 AM, Dawid Weiss <da...@gmail.com> wrote:

> What's worse -- neither method is "better". We at Carrot2 have a
> strong feeling that clusters should be described properly in order to
> be useful, but one may argue that in many, many applications of
> clustering, the labels are _not_ important and just individual
> features of clusters (like keywords or even documents themselves) are
> enough.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Posted by Dawid Weiss <da...@gmail.com>.

> 1. Top terms based on weight (e.g. TFIDF) -- Implemented in Mahout in the ClusterDumper - Just sort the top terms across the docs in the cluster and spit out some subset
> 2. Log-likelihood Ratio (LLR) - Implemented in Mahout in ClusterLabels, currently requires Lucene index, but could be modified - Calculates the log-likelihood for the terms in the vectors and then sorts based on the value and spits out the labels
> 3. Some type of LSA/SVD approach - Implemented in Carrot2, others - Identify the concepts by taking the SVD of the vectors and then determine/use the base concepts derived from that
> 4. Frequent Phrases using something like a suffix tree or other phrase detection methods - Implemented in in Carrot2 (Suffix Tree Clustering) and others - finds frequent phrases and sorts them based on a weight to return

Just a short follow-up note to this discussion.

The major difference is perhaps not in how you select labels, but how
you approach clustering AND labelling together -- whether you want to
find labels for existing groups of items or (which is what Carrot2
does) whether you want to find good potential cluster labels first and
then reject (or select) these labels that actually correspond to the
groups of documents found in the input. The actual algorithmic method
follows either of these paradigms.

I realize the difference may seem small at first, but it has huge
consequences in what you consider a cluster, clustering quality, etc.
Consider these facts:

- Clusters for which there are no good labels can (and probably
should) be discarded. We simply cannot convey their meaning to the
user, so why bother.

- In the second approach the structure of clusters is strongly
affected by the labels and the mechanism of assigning documents to
these labels. If you keep it strict (e.g. "every item in the cluster
must contain the cluster's exact phrase"), then the relationship
between a cluster's label and its content will directly convey
something to the user. If you allow some fuzziness (e.g., "every iterm
in the cluster must contain at least one word from the set of keywords
forming the cluster's label"), then the label is only a hint and the
user must still come up with the "glue" (whatever it is that caused
these particular keywords to be grouped together).

What's worse -- neither method is "better". We at Carrot2 have a
strong feeling that clusters should be described properly in order to
be useful, but one may argue that in many, many applications of
clustering, the labels are _not_ important and just individual
features of clusters (like keywords or even documents themselves) are
enough.

Back to the labelling methods Grant described -- STC and Lingo (from
Carrot2) are representatives of the "labels come first" apprach and
their usefulness will be limited in labelling existing document
clusters. STC, for example, is basically a complex feature extraction
algorithm (extraction of frequent phrases using suffix trees),
followed by a simple single-link greedy heuristic merging the document
sets implied by these features. Whether this approach can be scaled up
to labelling existing clusters -- I don't know.

What is certainly worth investigation is extraction of better
features. n-grams are a good candidate (especially with pruning
techniques). Suffix arrays could be used to speed up the process to
some extent (batching a good sample of documents in a single mapper,
running frequent phrase detection, emitting only frequent phrases that
exceed a given threshold in the batch). If you'd like to experiment
with this approach, suffix tree code is in Carrot2 and suffix arrays
are here: http://www.jsuffixarrays.org

I don't know if it helps in any way. Apologies if I've been
inconsistent in any of the above -- had about a dozen urgent (and I
mean: urgent) interrupts from my two-year-old.

Dawid

Re: Methods for Naming Clusters

Posted by Dawid Weiss <da...@gmail.com>.

> I don't know what Carrot2 is doing, but this often involves scanning phrases

We don't do much, really. If somebody is interested, let me know, I'll
send you a pdf of this:

Stanisław Osiński, Dawid Weiss: A Concept-Driven Algorithm for
Clustering Search Results. IEEE Intelligent Systems, May/June, 3 (vol.
20), 2005, pp. 48—54.

> from the documents to find the ones most saliently related to the cluster
> centroid.  It can have very nice results, especially if you add some sense
> of how unusual the phrase is in the corpus.

Yes, this is one possibility. You need to be careful to avoid
mismatches -- keywords that are part of the phrase that would be
misleading as ot the actual content of the cluster. My suggestion to
consider _phrases_ as features instead of single words still holds.
The sparsity of the input representation grows a lot, but the results
are much easier to interpret.

> This is also related to 1 and 2 except that it uses phrases and basically
> blows off the IDF part of the weighting (which is plausible since almost all
> phrases are pretty rare).  It is subject to problems where you get fixed
> phrases that proliferate through the corpus.  My own nasty experience was
> with the phrase "Staff writer of the Wall Street Journal" which seemed
> highly significant for several articles (for a moment).

In the original paper Zamir and Etzioni suggest to filter out phrases
that are too long, start or end with common words, etc. (there are
more heuristics). This works remarkably well, although your comment is
valid -- getting rid of such phrases is hard (but on the other hand --
so it is with junk keywords).

> In my recent experience, suffix trees are often over-kill since you can
> count all the phrases in a small document set (a cluster) very, very
> quickly.  You don't even usually need to look at all of the members of a
> cluster, just the few dozen closest to the centroid.

Suffix trees -- an overkill, but the task is quickly and efficiently
done using suffix arrays and then it may be worth the effort. The
discovery method does not change the final result, so naive methods
are usually best to start with to see if you can gain anything.

Dawid

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

On Sun, Jan 3, 2010 at 1:47 PM, Grant Ingersoll <gs...@apache.org> wrote:

> Just to add a little bit based on some research I've been doing on the
> subject, it seems there are several techniques for naming clusters, ranging
> from the mundane to the intricate:
>
> 1. Top terms based on weight (e.g. TFIDF) -- Implemented in Mahout in the
> ClusterDumper - Just sort the top terms across the docs in the cluster and
> spit out some subset
> 2. Log-likelihood Ratio (LLR) - Implemented in Mahout in ClusterLabels,
> currently requires Lucene index, but could be modified - Calculates the
> log-likelihood for the terms in the vectors and then sorts based on the
> value and spits out the labels
>

These two are actually closely related when you look at the math involved.
LLR has a few extra terms that help avoid some bad terms.

The LLR method can be extended to use a corpus bigram language model versus
a cluster bigram language model.  This requires counting bigrams over the
entire corpus as well as for the clusters.  If that is feasible, it can give
good results.

> 3. Some type of LSA/SVD approach - Implemented in Carrot2, others -
> Identify the concepts by taking the SVD of the vectors and then
> determine/use the base concepts derived from that
>

I don't know what Carrot2 is doing, but this often involves scanning phrases
from the documents to find the ones most saliently related to the cluster
centroid.  It can have very nice results, especially if you add some sense
of how unusual the phrase is in the corpus.

> 4. Frequent Phrases using something like a suffix tree or other phrase
> detection methods - Implemented in in Carrot2 (Suffix Tree Clustering) and
> others - finds frequent phrases and sorts them based on a weight to return
>

This is also related to 1 and 2 except that it uses phrases and basically
blows off the IDF part of the weighting (which is plausible since almost all
phrases are pretty rare).  It is subject to problems where you get fixed
phrases that proliferate through the corpus.  My own nasty experience was
with the phrase "Staff writer of the Wall Street Journal" which seemed
highly significant for several articles (for a moment).

In my recent experience, suffix trees are often over-kill since you can
count all the phrases in a small document set (a cluster) very, very
quickly.  You don't even usually need to look at all of the members of a
cluster, just the few dozen closest to the centroid.

> I'm probably missing some other approaches so feel free to fill in, but
> those are what I've come across so far.
>
> -Grant
>
> On Jan 3, 2010, at 3:07 PM, Ted Dunning wrote:
>
> > Good thing to do.
> >
> > Slightly tricky to do.  But worthy.
> >
> > On Sun, Jan 3, 2010 at 11:04 AM, Grant Ingersoll <gsingers@apache.org
> >wrote:
> >
> >> My first thought is to just create an n-gram model of the same field I'm
> >> clustering on (as that will allow the existing code to work unmodified),
> but
> >> I wanted to hear what others think.  Is it worth the time?
> >>
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
>
>
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Posted by Grant Ingersoll <gs...@apache.org>.

Just to add a little bit based on some research I've been doing on the subject, it seems there are several techniques for naming clusters, ranging from the mundane to the intricate:

1. Top terms based on weight (e.g. TFIDF) -- Implemented in Mahout in the ClusterDumper - Just sort the top terms across the docs in the cluster and spit out some subset
2. Log-likelihood Ratio (LLR) - Implemented in Mahout in ClusterLabels, currently requires Lucene index, but could be modified - Calculates the log-likelihood for the terms in the vectors and then sorts based on the value and spits out the labels
3. Some type of LSA/SVD approach - Implemented in Carrot2, others - Identify the concepts by taking the SVD of the vectors and then determine/use the base concepts derived from that
4. Frequent Phrases using something like a suffix tree or other phrase detection methods - Implemented in in Carrot2 (Suffix Tree Clustering) and others - finds frequent phrases and sorts them based on a weight to return

I'm probably missing some other approaches so feel free to fill in, but those are what I've come across so far.

-Grant

On Jan 3, 2010, at 3:07 PM, Ted Dunning wrote:

> Good thing to do.
> 
> Slightly tricky to do.  But worthy.
> 
> On Sun, Jan 3, 2010 at 11:04 AM, Grant Ingersoll <gs...@apache.org>wrote:
> 
>> My first thought is to just create an n-gram model of the same field I'm
>> clustering on (as that will allow the existing code to work unmodified), but
>> I wanted to hear what others think.  Is it worth the time?
>> 
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

Good thing to do.

Slightly tricky to do.  But worthy.

On Sun, Jan 3, 2010 at 11:04 AM, Grant Ingersoll <gs...@apache.org>wrote:

> My first thought is to just create an n-gram model of the same field I'm
> clustering on (as that will allow the existing code to work unmodified), but
> I wanted to hear what others think.  Is it worth the time?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Posted by Grant Ingersoll <gs...@apache.org>.

Resuscitating this again...

So, I committed MAHOUT-163 (thanks, Shashi!) which implements Ted's log likelihood ideas and I've been trying it out and also comparing it to what Carrot2 does for generating labels.  One of the things that I think would make sense is to extend MAHOUT-163 to have the option to return phrases instead of just terms.  My first thought is to just create an n-gram model of the same field I'm clustering on (as that will allow the existing code to work unmodified), but I wanted to hear what others think.  Is it worth the time?

I'm also interested in other approaches people have taken.

-Grant

On Sep 5, 2009, at 4:58 PM, Sebastien Bratieres wrote:

> Hi,
> 
> (I know this is an old topic -- but I am ressuscitating it on purpose !)
> 
> I've come across this article (Lafferty & Blei 2009)
> http://www.citeulike.org/user/maximzhao/article/5084329 which seems to build
> upon Ted's log likelihood ratio. The goal is exactly the original poster's
> question: how to characterize a topic cluster with its terms.
> Ted, I'd be interested in knowing your opinion on this article; most
> importantly, how easily it can be implemented and what improvement it brings
> over LLR.
> 
> I hope this can help people on the list who are busy with topic clustering !
> 
> Sebastien
> 
> 
> 2009/8/12 Shashikant Kore <sh...@gmail.com>
> 
>> I was referring to the condition where a phrase is identifies as good
>> by LLR and is also prominent feature of centroid.  But, as you
>> clarified, only LLR score is good indicator for top labels.
>> 
>> Thanks for the pointer for co-occurrence statistics. I will study some
>> literature on that.
>> 
>> --shashi
>> 
>> On Wed, Aug 12, 2009 at 11:23 PM, Ted Dunning<te...@gmail.com>
>> wrote:
>>> On Wed, Aug 12, 2009 at 6:12 AM, Shashikant Kore <shashikant@gmail.com
>>> wrote:
>>> 
>>>> 
>>>> Is this a necessary & sufficient  condition for a good cluster label?
>>> 
>>> 
>>> I am not entirely clear what "this" is.  My assertion is that high LLR
>> score
>>> is sufficient evidence to use the term or phrase.  I generally also limit
>>> the number of terms as well, taking only the highest scoring ones.  The
>>> necessary and sufficient phrase comes from a rigorous mathematical
>>> background that doesn't entirely apply here where we are talking about
>>> heuristics like this.
>>> 
>>> 
>>>> On a different note,  is there any way to identify relationship among
>>>> the top labels of the clusters? For example, if I have cluster related
>>>> automobiles, I may get the companies (GM, Ford, Toyota) along with
>>>> their poupular models (Corolla,  Cadillac, ) as top labels. How can I
>>>> figure out Toyota and Corolla are strongly related?
>>> 
>>> 
>>> Look at the co-occurrence statistics of the terms themselves.  Use that
>> to
>>> form a sparse graph.  Then do spectral clustering or agglomerative
>>> clustering on the graph.
>>> 
>>> That will give you clusters of terms that will give you much of what you
>>> seek.  Of course, the fact that the terms are being used to describe the
>>> same cluster means that you have a good chance of just replicating the
>> label
>>> sets for your clusters.
>>> 
>>> --
>>> Ted Dunning, CTO
>>> DeepDyve
>>> 
>>

Re: Methods for Naming Clusters

Posted by Sebastien Bratieres <sb...@cam.ac.uk>.

Hi,

(I know this is an old topic -- but I am ressuscitating it on purpose !)

I've come across this article (Lafferty & Blei 2009)
http://www.citeulike.org/user/maximzhao/article/5084329 which seems to build
upon Ted's log likelihood ratio. The goal is exactly the original poster's
question: how to characterize a topic cluster with its terms.
Ted, I'd be interested in knowing your opinion on this article; most
importantly, how easily it can be implemented and what improvement it brings
over LLR.

I hope this can help people on the list who are busy with topic clustering !

Sebastien


2009/8/12 Shashikant Kore <sh...@gmail.com>

> I was referring to the condition where a phrase is identifies as good
> by LLR and is also prominent feature of centroid.  But, as you
> clarified, only LLR score is good indicator for top labels.
>
> Thanks for the pointer for co-occurrence statistics. I will study some
> literature on that.
>
> --shashi
>
> On Wed, Aug 12, 2009 at 11:23 PM, Ted Dunning<te...@gmail.com>
> wrote:
> > On Wed, Aug 12, 2009 at 6:12 AM, Shashikant Kore <shashikant@gmail.com
> >wrote:
> >
> >>
> >> Is this a necessary & sufficient  condition for a good cluster label?
> >
> >
> > I am not entirely clear what "this" is.  My assertion is that high LLR
> score
> > is sufficient evidence to use the term or phrase.  I generally also limit
> > the number of terms as well, taking only the highest scoring ones.  The
> > necessary and sufficient phrase comes from a rigorous mathematical
> > background that doesn't entirely apply here where we are talking about
> > heuristics like this.
> >
> >
> >> On a different note,  is there any way to identify relationship among
> >> the top labels of the clusters? For example, if I have cluster related
> >> automobiles, I may get the companies (GM, Ford, Toyota) along with
> >> their poupular models (Corolla,  Cadillac, ) as top labels. How can I
> >> figure out Toyota and Corolla are strongly related?
> >
> >
> > Look at the co-occurrence statistics of the terms themselves.  Use that
> to
> > form a sparse graph.  Then do spectral clustering or agglomerative
> > clustering on the graph.
> >
> > That will give you clusters of terms that will give you much of what you
> > seek.  Of course, the fact that the terms are being used to describe the
> > same cluster means that you have a good chance of just replicating the
> label
> > sets for your clusters.
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>

Re: Methods for Naming Clusters

Posted by Shashikant Kore <sh...@gmail.com>.

I was referring to the condition where a phrase is identifies as good
by LLR and is also prominent feature of centroid.  But, as you
clarified, only LLR score is good indicator for top labels.

Thanks for the pointer for co-occurrence statistics. I will study some
literature on that.

--shashi

On Wed, Aug 12, 2009 at 11:23 PM, Ted Dunning<te...@gmail.com> wrote:
> On Wed, Aug 12, 2009 at 6:12 AM, Shashikant Kore <sh...@gmail.com>wrote:
>
>>
>> Is this a necessary & sufficient  condition for a good cluster label?
>
>
> I am not entirely clear what "this" is.  My assertion is that high LLR score
> is sufficient evidence to use the term or phrase.  I generally also limit
> the number of terms as well, taking only the highest scoring ones.  The
> necessary and sufficient phrase comes from a rigorous mathematical
> background that doesn't entirely apply here where we are talking about
> heuristics like this.
>
>
>> On a different note,  is there any way to identify relationship among
>> the top labels of the clusters? For example, if I have cluster related
>> automobiles, I may get the companies (GM, Ford, Toyota) along with
>> their poupular models (Corolla,  Cadillac, ) as top labels. How can I
>> figure out Toyota and Corolla are strongly related?
>
>
> Look at the co-occurrence statistics of the terms themselves.  Use that to
> form a sparse graph.  Then do spectral clustering or agglomerative
> clustering on the graph.
>
> That will give you clusters of terms that will give you much of what you
> seek.  Of course, the fact that the terms are being used to describe the
> same cluster means that you have a good chance of just replicating the label
> sets for your clusters.
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Aug 12, 2009 at 6:12 AM, Shashikant Kore <sh...@gmail.com>wrote:

>
> Is this a necessary & sufficient  condition for a good cluster label?

I am not entirely clear what "this" is.  My assertion is that high LLR score
is sufficient evidence to use the term or phrase.  I generally also limit
the number of terms as well, taking only the highest scoring ones.  The
necessary and sufficient phrase comes from a rigorous mathematical
background that doesn't entirely apply here where we are talking about
heuristics like this.

> On a different note,  is there any way to identify relationship among
> the top labels of the clusters? For example, if I have cluster related
> automobiles, I may get the companies (GM, Ford, Toyota) along with
> their poupular models (Corolla,  Cadillac, ) as top labels. How can I
> figure out Toyota and Corolla are strongly related?

Look at the co-occurrence statistics of the terms themselves.  Use that to
form a sparse graph.  Then do spectral clustering or agglomerative
clustering on the graph.

That will give you clusters of terms that will give you much of what you
seek.  Of course, the fact that the terms are being used to describe the
same cluster means that you have a good chance of just replicating the label
sets for your clusters.

-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Posted by Shashikant Kore <sh...@gmail.com>.

On Wed, Aug 12, 2009 at 10:50 AM, Ted Dunning<te...@gmail.com> wrote:
> Whoa....
>
> No.  It sounds like I have muddied things thoroughly.  What I was saying is
> that there are times that tf.idf and llr agree and times that tf.idf and llr
> disagree.  In my experience, most of the second category are where tf.idf is
> over-weighting coincidental cases or where both scores are producing not
> good stuff.
>
> If a phrase or term is marked as good by LLR and is a prominent feature of
> the centroid, that is fine.
>

Thanks for the explanation, Ted.

Is this a necessary & sufficient  condition for a good cluster label?

On a different note,  is there any way to identify relationship among
the top labels of the clusters? For example, if I have cluster related
automobiles, I may get the companies (GM, Ford, Toyota) along with
their poupular models (Corolla,  Cadillac, ) as top labels. How can I
figure out Toyota and Corolla are strongly related?

--shashi

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

Whoa....

No.  It sounds like I have muddied things thoroughly.  What I was saying is
that there are times that tf.idf and llr agree and times that tf.idf and llr
disagree.  In my experience, most of the second category are where tf.idf is
over-weighting coincidental cases or where both scores are producing not
good stuff.

If a phrase or term is marked as good by LLR and is a prominent feature of
the centroid, that is fine.

On Tue, Aug 11, 2009 at 10:54 AM, Shashikant Kore <sh...@gmail.com>wrote:

> On Tue, Aug 11, 2009 at 8:57 PM, Ted Dunning<te...@gmail.com> wrote:
> > If you expand the LLR equation and look at which terms are big, you will
> see
> > k_11 * log(mumble)  as an important term for many words.  Usually, this
> is
> > about the same as tf.idf since mumble is about the same as the term
> > frequency.  For a single document, tf.idf is a very close approximation
> of
> > LLR.  With many documents, the situation can change dramatically,
> however,
> > and you can get cancellation effects that eliminate documents that would
> > otherwise have high tf.idf.  These are generally the terms that lead to
> > over-fitting with methods like naive bayes and are often not such great
> > cluster descriptors.
> >
>
> Let me restate what I understood.
>
> If a phrase is identified as prominent phrase by LLR and it also
> happens to be the top-weighted feature in the centroid vector, it is
> not a good candidate for cluster label.
>
> Is this correct?

-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Posted by Shashikant Kore <sh...@gmail.com>.

On Tue, Aug 11, 2009 at 8:57 PM, Ted Dunning<te...@gmail.com> wrote:
> If you expand the LLR equation and look at which terms are big, you will see
> k_11 * log(mumble)  as an important term for many words.  Usually, this is
> about the same as tf.idf since mumble is about the same as the term
> frequency.  For a single document, tf.idf is a very close approximation of
> LLR.  With many documents, the situation can change dramatically, however,
> and you can get cancellation effects that eliminate documents that would
> otherwise have high tf.idf.  These are generally the terms that lead to
> over-fitting with methods like naive bayes and are often not such great
> cluster descriptors.
>

Let me restate what I understood.

If a phrase is identified as prominent phrase by LLR and it also
happens to be the top-weighted feature in the centroid vector, it is
not a good candidate for cluster label.

Is this correct?

--shashi

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

If you expand the LLR equation and look at which terms are big, you will see
k_11 * log(mumble)  as an important term for many words.  Usually, this is
about the same as tf.idf since mumble is about the same as the term
frequency.  For a single document, tf.idf is a very close approximation of
LLR.  With many documents, the situation can change dramatically, however,
and you can get cancellation effects that eliminate documents that would
otherwise have high tf.idf.  These are generally the terms that lead to
over-fitting with methods like naive bayes and are often not such great
cluster descriptors.

On Tue, Aug 11, 2009 at 4:29 AM, Shashikant Kore <sh...@gmail.com>wrote:

> A simple interpretation of this can be given by the fact that when a
> phrase is quite common in a small cluster but uncommon out of cluster,
> it is going to have a higher (TF-IDF) weight in the document vector.
> Such a phrase is identified as prominent by LLR as well.
>
> Is there any other reason for this to occur?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Posted by Shashikant Kore <sh...@gmail.com>.

Hi Ted,

An observation from the results.

Previously, I just took the top 10 features of the centroid vector as
labels.  The order was decided by the weight of the feature in the
centroid vector.

Now, when I see the top phrases with LLR method, I see there is an
overlap between these two result sets. They results are not exactly
same, but most of the top features of vector score high with LLR
technique as well.  In few cases, I can see LLR technique comes up
with phrases which are very different.

A simple interpretation of this can be given by the fact that when a
phrase is quite common in a small cluster but uncommon out of cluster,
it is going to have a higher (TF-IDF) weight in the document vector.
Such a phrase is identified as prominent by LLR as well.

Is there any other reason for this to occur?

Thanks,

--shashi

Re: Methods for Naming Clusters

Posted by Shashikant Kore <sh...@gmail.com>.

On Tue, Aug 11, 2009 at 12:11 AM, Ted Dunning<te...@gmail.com> wrote:
> *llr <- function(x) { 2*(entropy(rowSums(x)) + entropy(colSums(x)) -
> entropy(x))}
> *
>
>  I moved around some minus signs to make me happy, btw.  All that
> matters in the end is that entropy and llr both give positive values.
>

I am not sure if I am doing it right.  I get all entropy values
positive. But, the LLR value is negative, if calculated as above.  LLR
will be positive positive if the equation as specified on the blog is
used ie.

llr <- function(x) { 2*( entropy(x) - entropy(rowSums(x)) -
entropy(colSums(x)))}

--shashi

Re: Methods for Naming Clusters

Posted by Shashikant Kore <sh...@gmail.com>.

Great!

Solved the equations on paper and now I understand entroy and LLR.

Thanks, Ted.

--shashi

On Tue, Aug 11, 2009 at 12:11 AM, Ted Dunning<te...@gmail.com> wrote:
> I have some R code handy.  It should be easy to translate to Mahoutish Java.
>
> *entropy <- function(x) {-sum(x*log((x+(x==0))/sum(x)))}*
>
> The only tricky bit here is that sum will sum up a vector or a matrix and
> (x==0) returns a matrix or vector of booleans that is the same shape as x.
> Adding a boolean means that all instances where x!=0 are left alone while
> all zeros are incremented by one.  This causes the log to return zero, but
> we are (elementwise) multiplying by x anyway.  Elements of x cannot be < 0
> by definition (they are counts).  The reason for all of that is so that 0
> log 0 = 0 as desired.  If sum(x) = 1, then this function will give you
> Shannon entropy.
>
> and then:
>
> *llr <- function(x) { 2*(entropy(rowSums(x)) + entropy(colSums(x)) -
> entropy(x))}
> *
>
> The functions rowSums and colSums do pretty much what you would expect.
> Given a matrix, they add up the elements in each row or column (to get a
> vector).  I moved around some minus signs to make me happy, btw.  All that
> matters in the end is that entropy and llr both give positive values.
>
> Some test cases:
>
> *> entropy(c(1,1))
> [1] 1.386294
>> llr(matrix(c(1,0,0,1), nrow=2))
> [1] 2.772589
>> llr(matrix(c(10,0,0,10), nrow=2))
> [1] 27.72589
>> llr(matrix(c(5,1995,0,100000), nrow=2))
> [1] 39.33052
>> llr(matrix(c(1000,1995,1000,100000), nrow=2))
> [1] 4730.737
>> llr(matrix(c(1000,1000,1000,100000), nrow=2))
> [1] 5734.343
>> llr(matrix(c(1000,1000,1000,99000), nrow=2))
> [1] 5714.932
> *
>
>
> On Mon, Aug 10, 2009 at 11:11 AM, Shashikant Kore <sh...@gmail.com>wrote:
>
>> Ted,
>>
>> Thank you for the elaborate explanation.
>>
>> I think, I just hit the class "this is left as an exercise to the
>> reader. One last query (I hope)
>>
>> On your blog you have defined LLR as follows.
>>
>> >
>> > LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))
>> > where H is Shannon's entropy, computed as the sum of (k_ij / sum(k)) log
>> (k_ij / sum(k))
>> >
>>
>> I am unable to follow this. Do you have any code to explain this? Or
>> elaboration for following example will also be equally great.
>>
>> > Also suppose that the corpus has 100,000 documents in it we have (k11,
>> k12, k21, k22, llr) as
>> >
>> > 5, 1995, 100, 97900, 2.96
>> >
>>
>> Thanks,
>>
>> --shashi
>>
>> On Mon, Aug 10, 2009 at 10:32 PM, Ted Dunning<te...@gmail.com>
>> wrote:
>> > On Mon, Aug 10, 2009 at 6:51 AM, Shashikant Kore <shashikant@gmail.com
>> >wrote:
>> >
>> >> I have little difficulty in understanding LLR for cluster labels.
>> >>
>> >
>> > Sorry about that.  I will try to be more clear.
>> >
>> >
>> >>  For a phrase, if
>> >> - in-cluster doc frequency is  inDF
>> >> - out-of-cluster doc frequency is  outDF
>> >> - size of the cluster is clusterSize
>> >> - size of the corpus is corpusSize
>> >>
>> >
>> > Good.
>> >
>> >
>> >> how do I calculate the LLR?
>> >>
>> >
>> > Assuming that the corpus is a superset of the cluster, form the table
>> using:
>> >
>> >     k11 = inDF
>> >     k12 = clusterSize - inDF
>> >     k21 = outDF
>> >     k22 = corpusSize - clusterSize - outDF
>> >
>> > If the cluster is not a subset of the corpus, then k22 = corpusSize -
>> outDF
>> >
>> >
>> >>  I have difficulty in mapping these numbers to Event A & Event B that
>> >> you talked on your blog.
>> >>
>> >
>> > Event A is in-cluster, Event B is out-of-cluster.
>> >
>> >
>> >>  From the basic numbers, I could come up with inCluster percentage. But
>> >> that doesn't help much. For example,  let's say my cluster size is
>> >> 2000 documents and corpus size is 1000.  A phrase which occurs in the
>> >> cluster in 5 documents and doesn't appear outside cluster has
>> >> inCluster percentage of 100. Another phrase which occurs 1000 times in
>> >> the cluster and 1000 times outside cluster. This phrase has inCluster
>> >> percentage of 50. Intuitively, this is a better candidate for label
>> >> that previous one. But, I am unable to figure out how these numbers
>> >> need to be normalized.
>> >>
>> >
>> > First, the corpus size should normally be much larger than your cluster
>> > size.  With document categorization, the ratio is enormous, with
>> clustering
>> > it should still be at least one order of magnitude larger.
>> >
>> > So let's take your example and add a case where in-cluster = 5 and
>> > out-cluster =5, and another where in-cluster=5, out-cluster=100 and
>> another
>> > where in-cluster=1000 and out-cluster 45,000.
>> >
>> > Also suppose that the corpus has 100,000 documents in it we have (k11,
>> k12,
>> > k21, k22, llr) as
>> >
>> > 5, 1995, 0, 98000, 39.33
>> > 5, 1995, 5, 97995, 25.47
>> > 5, 1995, 100, 97900, 2.96
>> > 1000, 1000. 1000, 97000, 5714.93
>> > 1000, 1000, 45000, 48000, 2.04
>> >
>> > According to llr, your original case of 5 in and 0 out is definitely
>> worthy
>> > of mention and the case with 5 in and 5 out is somewhat less interesting.
>> > The case with 5 in and 100 out is not very interesting, nor is the case
>> with
>> > 1000 in and 45000 out.  Your case with 1000 in and 1000 out is the most
>> > exciting of them all.
>> >
>> > The most useful way to think of this is as the percentage of in-cluster
>> > documents that have the feature (term) versus the percentage out, keeping
>> in
>> > mind that both percentages are uncertain since we have only a sample of
>> all
>> > possible documents.  Where these percentages are very different and where
>> > that difference is unlikely to be due to accidental variation, then LLR
>> will
>> > be large.
>> >
>> >
>> > I don't know if I mentioned this on the blog, but it is often nice to
>> > rescale these scores by taking the square root and adding a sign
>> according
>> > to whether k11/(k11+k12) > k21/(k21+k22).  This gives you a number that
>> has
>> > the same scale as a normal distribution so lots more people will have
>> good
>> > intuitions about what is large and what is not.
>> >
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

I have some R code handy.  It should be easy to translate to Mahoutish Java.

*entropy <- function(x) {-sum(x*log((x+(x==0))/sum(x)))}*

The only tricky bit here is that sum will sum up a vector or a matrix and
(x==0) returns a matrix or vector of booleans that is the same shape as x.
Adding a boolean means that all instances where x!=0 are left alone while
all zeros are incremented by one.  This causes the log to return zero, but
we are (elementwise) multiplying by x anyway.  Elements of x cannot be < 0
by definition (they are counts).  The reason for all of that is so that 0
log 0 = 0 as desired.  If sum(x) = 1, then this function will give you
Shannon entropy.

and then:

*llr <- function(x) { 2*(entropy(rowSums(x)) + entropy(colSums(x)) -
entropy(x))}
*

The functions rowSums and colSums do pretty much what you would expect.
Given a matrix, they add up the elements in each row or column (to get a
vector).  I moved around some minus signs to make me happy, btw.  All that
matters in the end is that entropy and llr both give positive values.

Some test cases:

*> entropy(c(1,1))
[1] 1.386294
> llr(matrix(c(1,0,0,1), nrow=2))
[1] 2.772589
> llr(matrix(c(10,0,0,10), nrow=2))
[1] 27.72589
> llr(matrix(c(5,1995,0,100000), nrow=2))
[1] 39.33052
> llr(matrix(c(1000,1995,1000,100000), nrow=2))
[1] 4730.737
> llr(matrix(c(1000,1000,1000,100000), nrow=2))
[1] 5734.343
> llr(matrix(c(1000,1000,1000,99000), nrow=2))
[1] 5714.932
*


On Mon, Aug 10, 2009 at 11:11 AM, Shashikant Kore <sh...@gmail.com>wrote:

> Ted,
>
> Thank you for the elaborate explanation.
>
> I think, I just hit the class "this is left as an exercise to the
> reader. One last query (I hope)
>
> On your blog you have defined LLR as follows.
>
> >
> > LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))
> > where H is Shannon's entropy, computed as the sum of (k_ij / sum(k)) log
> (k_ij / sum(k))
> >
>
> I am unable to follow this. Do you have any code to explain this? Or
> elaboration for following example will also be equally great.
>
> > Also suppose that the corpus has 100,000 documents in it we have (k11,
> k12, k21, k22, llr) as
> >
> > 5, 1995, 100, 97900, 2.96
> >
>
> Thanks,
>
> --shashi
>
> On Mon, Aug 10, 2009 at 10:32 PM, Ted Dunning<te...@gmail.com>
> wrote:
> > On Mon, Aug 10, 2009 at 6:51 AM, Shashikant Kore <shashikant@gmail.com
> >wrote:
> >
> >> I have little difficulty in understanding LLR for cluster labels.
> >>
> >
> > Sorry about that.  I will try to be more clear.
> >
> >
> >>  For a phrase, if
> >> - in-cluster doc frequency is  inDF
> >> - out-of-cluster doc frequency is  outDF
> >> - size of the cluster is clusterSize
> >> - size of the corpus is corpusSize
> >>
> >
> > Good.
> >
> >
> >> how do I calculate the LLR?
> >>
> >
> > Assuming that the corpus is a superset of the cluster, form the table
> using:
> >
> >     k11 = inDF
> >     k12 = clusterSize - inDF
> >     k21 = outDF
> >     k22 = corpusSize - clusterSize - outDF
> >
> > If the cluster is not a subset of the corpus, then k22 = corpusSize -
> outDF
> >
> >
> >>  I have difficulty in mapping these numbers to Event A & Event B that
> >> you talked on your blog.
> >>
> >
> > Event A is in-cluster, Event B is out-of-cluster.
> >
> >
> >>  From the basic numbers, I could come up with inCluster percentage. But
> >> that doesn't help much. For example,  let's say my cluster size is
> >> 2000 documents and corpus size is 1000.  A phrase which occurs in the
> >> cluster in 5 documents and doesn't appear outside cluster has
> >> inCluster percentage of 100. Another phrase which occurs 1000 times in
> >> the cluster and 1000 times outside cluster. This phrase has inCluster
> >> percentage of 50. Intuitively, this is a better candidate for label
> >> that previous one. But, I am unable to figure out how these numbers
> >> need to be normalized.
> >>
> >
> > First, the corpus size should normally be much larger than your cluster
> > size.  With document categorization, the ratio is enormous, with
> clustering
> > it should still be at least one order of magnitude larger.
> >
> > So let's take your example and add a case where in-cluster = 5 and
> > out-cluster =5, and another where in-cluster=5, out-cluster=100 and
> another
> > where in-cluster=1000 and out-cluster 45,000.
> >
> > Also suppose that the corpus has 100,000 documents in it we have (k11,
> k12,
> > k21, k22, llr) as
> >
> > 5, 1995, 0, 98000, 39.33
> > 5, 1995, 5, 97995, 25.47
> > 5, 1995, 100, 97900, 2.96
> > 1000, 1000. 1000, 97000, 5714.93
> > 1000, 1000, 45000, 48000, 2.04
> >
> > According to llr, your original case of 5 in and 0 out is definitely
> worthy
> > of mention and the case with 5 in and 5 out is somewhat less interesting.
> > The case with 5 in and 100 out is not very interesting, nor is the case
> with
> > 1000 in and 45000 out.  Your case with 1000 in and 1000 out is the most
> > exciting of them all.
> >
> > The most useful way to think of this is as the percentage of in-cluster
> > documents that have the feature (term) versus the percentage out, keeping
> in
> > mind that both percentages are uncertain since we have only a sample of
> all
> > possible documents.  Where these percentages are very different and where
> > that difference is unlikely to be due to accidental variation, then LLR
> will
> > be large.
> >
> >
> > I don't know if I mentioned this on the blog, but it is often nice to
> > rescale these scores by taking the square root and adding a sign
> according
> > to whether k11/(k11+k12) > k21/(k21+k22).  This gives you a number that
> has
> > the same scale as a normal distribution so lots more people will have
> good
> > intuitions about what is large and what is not.
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Methods for Naming Clusters

Posted by Shashikant Kore <sh...@gmail.com>.

Ted,

Thank you for the elaborate explanation.

I think, I just hit the class "this is left as an exercise to the
reader. One last query (I hope)

On your blog you have defined LLR as follows.

>
> LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))
> where H is Shannon's entropy, computed as the sum of (k_ij / sum(k)) log (k_ij / sum(k))
>

I am unable to follow this. Do you have any code to explain this? Or
elaboration for following example will also be equally great.

> Also suppose that the corpus has 100,000 documents in it we have (k11, k12, k21, k22, llr) as
>
> 5, 1995, 100, 97900, 2.96
>

Thanks,

--shashi

On Mon, Aug 10, 2009 at 10:32 PM, Ted Dunning<te...@gmail.com> wrote:
> On Mon, Aug 10, 2009 at 6:51 AM, Shashikant Kore <sh...@gmail.com>wrote:
>
>> I have little difficulty in understanding LLR for cluster labels.
>>
>
> Sorry about that.  I will try to be more clear.
>
>
>>  For a phrase, if
>> - in-cluster doc frequency is  inDF
>> - out-of-cluster doc frequency is  outDF
>> - size of the cluster is clusterSize
>> - size of the corpus is corpusSize
>>
>
> Good.
>
>
>> how do I calculate the LLR?
>>
>
> Assuming that the corpus is a superset of the cluster, form the table using:
>
>     k11 = inDF
>     k12 = clusterSize - inDF
>     k21 = outDF
>     k22 = corpusSize - clusterSize - outDF
>
> If the cluster is not a subset of the corpus, then k22 = corpusSize - outDF
>
>
>>  I have difficulty in mapping these numbers to Event A & Event B that
>> you talked on your blog.
>>
>
> Event A is in-cluster, Event B is out-of-cluster.
>
>
>>  From the basic numbers, I could come up with inCluster percentage. But
>> that doesn't help much. For example,  let's say my cluster size is
>> 2000 documents and corpus size is 1000.  A phrase which occurs in the
>> cluster in 5 documents and doesn't appear outside cluster has
>> inCluster percentage of 100. Another phrase which occurs 1000 times in
>> the cluster and 1000 times outside cluster. This phrase has inCluster
>> percentage of 50. Intuitively, this is a better candidate for label
>> that previous one. But, I am unable to figure out how these numbers
>> need to be normalized.
>>
>
> First, the corpus size should normally be much larger than your cluster
> size.  With document categorization, the ratio is enormous, with clustering
> it should still be at least one order of magnitude larger.
>
> So let's take your example and add a case where in-cluster = 5 and
> out-cluster =5, and another where in-cluster=5, out-cluster=100 and another
> where in-cluster=1000 and out-cluster 45,000.
>
> Also suppose that the corpus has 100,000 documents in it we have (k11, k12,
> k21, k22, llr) as
>
> 5, 1995, 0, 98000, 39.33
> 5, 1995, 5, 97995, 25.47
> 5, 1995, 100, 97900, 2.96
> 1000, 1000. 1000, 97000, 5714.93
> 1000, 1000, 45000, 48000, 2.04
>
> According to llr, your original case of 5 in and 0 out is definitely worthy
> of mention and the case with 5 in and 5 out is somewhat less interesting.
> The case with 5 in and 100 out is not very interesting, nor is the case with
> 1000 in and 45000 out.  Your case with 1000 in and 1000 out is the most
> exciting of them all.
>
> The most useful way to think of this is as the percentage of in-cluster
> documents that have the feature (term) versus the percentage out, keeping in
> mind that both percentages are uncertain since we have only a sample of all
> possible documents.  Where these percentages are very different and where
> that difference is unlikely to be due to accidental variation, then LLR will
> be large.
>
>
> I don't know if I mentioned this on the blog, but it is often nice to
> rescale these scores by taking the square root and adding a sign according
> to whether k11/(k11+k12) > k21/(k21+k22).  This gives you a number that has
> the same scale as a normal distribution so lots more people will have good
> intuitions about what is large and what is not.
>

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Aug 10, 2009 at 6:51 AM, Shashikant Kore <sh...@gmail.com>wrote:

> I have little difficulty in understanding LLR for cluster labels.
>

Sorry about that.  I will try to be more clear.

>  For a phrase, if
> - in-cluster doc frequency is  inDF
> - out-of-cluster doc frequency is  outDF
> - size of the cluster is clusterSize
> - size of the corpus is corpusSize
>

Good.

> how do I calculate the LLR?
>

Assuming that the corpus is a superset of the cluster, form the table using:

     k11 = inDF
     k12 = clusterSize - inDF
     k21 = outDF
     k22 = corpusSize - clusterSize - outDF

If the cluster is not a subset of the corpus, then k22 = corpusSize - outDF

>  I have difficulty in mapping these numbers to Event A & Event B that
> you talked on your blog.
>

Event A is in-cluster, Event B is out-of-cluster.

>  From the basic numbers, I could come up with inCluster percentage. But
> that doesn't help much. For example,  let's say my cluster size is
> 2000 documents and corpus size is 1000.  A phrase which occurs in the
> cluster in 5 documents and doesn't appear outside cluster has
> inCluster percentage of 100. Another phrase which occurs 1000 times in
> the cluster and 1000 times outside cluster. This phrase has inCluster
> percentage of 50. Intuitively, this is a better candidate for label
> that previous one. But, I am unable to figure out how these numbers
> need to be normalized.
>

First, the corpus size should normally be much larger than your cluster
size.  With document categorization, the ratio is enormous, with clustering
it should still be at least one order of magnitude larger.

So let's take your example and add a case where in-cluster = 5 and
out-cluster =5, and another where in-cluster=5, out-cluster=100 and another
where in-cluster=1000 and out-cluster 45,000.

Also suppose that the corpus has 100,000 documents in it we have (k11, k12,
k21, k22, llr) as

5, 1995, 0, 98000, 39.33
5, 1995, 5, 97995, 25.47
5, 1995, 100, 97900, 2.96
1000, 1000. 1000, 97000, 5714.93
1000, 1000, 45000, 48000, 2.04

According to llr, your original case of 5 in and 0 out is definitely worthy
of mention and the case with 5 in and 5 out is somewhat less interesting.
The case with 5 in and 100 out is not very interesting, nor is the case with
1000 in and 45000 out.  Your case with 1000 in and 1000 out is the most
exciting of them all.

The most useful way to think of this is as the percentage of in-cluster
documents that have the feature (term) versus the percentage out, keeping in
mind that both percentages are uncertain since we have only a sample of all
possible documents.  Where these percentages are very different and where
that difference is unlikely to be due to accidental variation, then LLR will
be large.

I don't know if I mentioned this on the blog, but it is often nice to
rescale these scores by taking the square root and adding a sign according
to whether k11/(k11+k12) > k21/(k21+k22).  This gives you a number that has
the same scale as a normal distribution so lots more people will have good
intuitions about what is large and what is not.

Re: Methods for Naming Clusters

Posted by Shashikant Kore <sh...@gmail.com>.

On Thu, Aug 6, 2009 at 2:57 AM, Ted Dunning<te...@gmail.com> wrote:
>
> Generally, I find it better to use a more nuanced approach to try to get
> something like best common substring, or best over-represented sub-strings.
> A nice way to do that is the log-likelihood ratio test that I use for
> everything under the sun.  This would consider in-cluster and out-of-cluster
> as two classes and would consider the frequency of each possible term or
> phrase in these two classes.  This will give you words and phrases that are
> anomalously common in your cluster and relatively rare outside it.  You may
> want to use document frequency for these comparisons since you can often get
> those frequencies from, for example, a Lucene index more easily than the
> actual number of occurrences.
>

Hi  Ted,

I have little difficulty in understanding LLR for cluster labels.

For a phrase, if
- in-cluster doc frequency is  inDF
- out-of-cluster doc frequency is  outDF
- size of the cluster is clusterSize
- size of the corpus is corpusSize

how do I calculate the LLR?

I have difficulty in mapping these numbers to Event A & Event B that
you talked on your blog.

>From the basic numbers, I could come up with inCluster percentage. But
that doesn't help much. For example,  let's say my cluster size is
2000 documents and corpus size is 1000.  A phrase which occurs in the
cluster in 5 documents and doesn't appear outside cluster has
inCluster percentage of 100. Another phrase which occurs 1000 times in
the cluster and 1000 times outside cluster. This phrase has inCluster
percentage of 50. Intuitively, this is a better candidate for label
that previous one. But, I am unable to figure out how these numbers
need to be normalized.

It would be great if you could shed some light on this.

Thanks,

--shashi

Re: Methods for Naming Clusters

Posted by Paul Ingles <pa...@oobaloo.co.uk>.

Great, thanks for the pointers Ted, will take a look when I'm back in  
the office tomorrow!


On 5 Aug 2009, at 22:27, Ted Dunning wrote:

> LCS is often hilariously bad for various kinds of documents because  
> it tends
> to pick up boiler-plate (my favorite example is "Staff writer of the  
> Wall
> Street Journal" ... a 7 word phrase shared with a huge fraction of the
> documents I was working with at the time).
>
> Generally, I find it better to use a more nuanced approach to try to  
> get
> something like best common substring, or best over-represented sub- 
> strings.
> A nice way to do that is the log-likelihood ratio test that I use for
> everything under the sun.  This would consider in-cluster and out-of- 
> cluster
> as two classes and would consider the frequency of each possible  
> term or
> phrase in these two classes.  This will give you words and phrases  
> that are
> anomalously common in your cluster and relatively rare outside it.   
> You may
> want to use document frequency for these comparisons since you can  
> often get
> those frequencies from, for example, a Lucene index more easily than  
> the
> actual number of occurrences.
>
> See my blog about likelihood ratio
> tests<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html 
> >or
> my
> original paper on likelihood ratio
> tests<http://citeseerx.ist.psu.edu/viewdoc/summary? 
> doi=10.1.1.54.2186>or
> the
> section in Manning and Schuetze's book on
> collocations<http://books.google.com/books?id=YiFDxbEX3SUC&lpg=PA52&ots=vXuoqtjESI&dq=manning%20and%20schuetze%2C%20pdf&pg=PA172#v=onepage&q=dunning&f=false 
> >for
> more details.
>
> Another approach would be to apply an LLR test to find concepts at  
> the term
> or phrase level that seem over-represented in the cluster.  Random  
> indexing
> or DCA or LDA would be reasonable ways to get conceptual level
> representations.  This would let you handle cases where you don't  
> have exact
> matches on terminology.  Random indexing could be combined with sim- 
> hashes
> to get very nice approximate conceptual matches.
>
> On Wed, Aug 5, 2009 at 12:38 PM, Paul Ingles <pa...@oobaloo.co.uk>  
> wrote:
>
>> Hi,
>>
>> As I've mentioned in the past, I'm working on clustering documents  
>> (albeit
>> relatively small ones). The cluster mechanism I've ended up with has
>> produced some pretty good results (at least for what I need to be  
>> able to
>> do). However, what I'd like to be able to do is find a way to  
>> automate the
>> naming of these groups.
>>
>> For example, if each document has a 6/7 word title, I'd like to  
>> produce
>> names that are somewhat logically ordered (that is they make  
>> grammatical
>> sense, this can probably be inferred by the frequency in the  
>> clusters: most
>> documents in a cluster should be well-formed) and share terms  
>> across the
>> majority of the titles.
>>
>> So far, I'm using a kind of hacked-together longest common substring
>> method:
>>
>> * Sort the titles within the cluster
>> * Compare every string against every other string, producing a LCS  
>> value
>> * Use the most common LCS
>>
>> As this is all relatively new ground for me, I was wondering  
>> whether there
>> were any better methods I could be using?
>>
>> Thanks,
>> Paul
>>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve

Re: Methods for Naming Clusters

Posted by Ted Dunning <te...@gmail.com>.

LCS is often hilariously bad for various kinds of documents because it tends
to pick up boiler-plate (my favorite example is "Staff writer of the Wall
Street Journal" ... a 7 word phrase shared with a huge fraction of the
documents I was working with at the time).

Generally, I find it better to use a more nuanced approach to try to get
something like best common substring, or best over-represented sub-strings.
A nice way to do that is the log-likelihood ratio test that I use for
everything under the sun.  This would consider in-cluster and out-of-cluster
as two classes and would consider the frequency of each possible term or
phrase in these two classes.  This will give you words and phrases that are
anomalously common in your cluster and relatively rare outside it.  You may
want to use document frequency for these comparisons since you can often get
those frequencies from, for example, a Lucene index more easily than the
actual number of occurrences.

See my blog about likelihood ratio
tests<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>or
my
original paper on likelihood ratio
tests<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.2186>or
the
section in Manning and Schuetze's book on
collocations<http://books.google.com/books?id=YiFDxbEX3SUC&lpg=PA52&ots=vXuoqtjESI&dq=manning%20and%20schuetze%2C%20pdf&pg=PA172#v=onepage&q=dunning&f=false>for
more details.

Another approach would be to apply an LLR test to find concepts at the term
or phrase level that seem over-represented in the cluster.  Random indexing
or DCA or LDA would be reasonable ways to get conceptual level
representations.  This would let you handle cases where you don't have exact
matches on terminology.  Random indexing could be combined with sim-hashes
to get very nice approximate conceptual matches.

On Wed, Aug 5, 2009 at 12:38 PM, Paul Ingles <pa...@oobaloo.co.uk> wrote:

> Hi,
>
> As I've mentioned in the past, I'm working on clustering documents (albeit
> relatively small ones). The cluster mechanism I've ended up with has
> produced some pretty good results (at least for what I need to be able to
> do). However, what I'd like to be able to do is find a way to automate the
> naming of these groups.
>
> For example, if each document has a 6/7 word title, I'd like to produce
> names that are somewhat logically ordered (that is they make grammatical
> sense, this can probably be inferred by the frequency in the clusters: most
> documents in a cluster should be well-formed) and share terms across the
> majority of the titles.
>
> So far, I'm using a kind of hacked-together longest common substring
> method:
>
> * Sort the titles within the cluster
> * Compare every string against every other string, producing a LCS value
> * Use the most common LCS
>
> As this is all relatively new ground for me, I was wondering whether there
> were any better methods I could be using?
>
> Thanks,
> Paul
>

-- 
Ted Dunning, CTO
DeepDyve