You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ken Krugler <kk...@transpac.com> on 2010/02/11 03:04:17 UTC

Suggestions for best approach to classic document clustering

Hi all,

Give the code currently in Mahout (+ Lucene), is there a generally  
accepted best approach for clustering of documents?

Assumptions are small document sets (e.g. a few thousand), with  
documents being representative data from web pages, all in English.

I've been fooling around with a few different combinations, e.g. pre- 
processing the documents to extract keywords and using these for  
clustering w/k-means, canopy, mean-shift canopy.

But before I have too much fun twiddling all the dials, it would be  
great to get input on good/bad options.

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Suggestions for best approach to classic document clustering

Posted by Dawid Weiss <da...@gmail.com>.

> We originally tried Carrot2, and the results weren't bad. But one additional
> fact I should have mentioned is that I need to be able to use the resulting
> clusters to do subsequent classification of future documents. From what I
> could tell w/Carrot2 and perusing the mailing list, this isn't really
> feasible due to the lack of a centroid from the Carrot2 clusters.

It's because there are no "centroids" in the sense of, let's say,
k-means. Assuming cluster centroid is an average vector of its
documents' words, you could compute them later... but it's not as
clean as having a centroid that is a result of the algorithm's
internal workings of course.

Like Ted mentioned, Carrot2 algorithms are designed to run in-memory
and with no harsh constraints on memory use (when there is a
speed-vs.memory tradeoff, we usually choose speed), so the problem
size will be bounded by maximum size of Java arrays, if nothing else.

Dawid

Re: Suggestions for best approach to classic document clustering

Posted by Drew Farris <dr...@gmail.com>.

Hi Ken,

On Wed, Feb 10, 2010 at 10:29 PM, Ken Krugler
<kk...@transpac.com> wrote:
>
> Is there any support currently in Mahout for generating tf-idf weighted
> vectors without creating a Lucene index? Just curious.
[..]
> I assume you'd use something like the Lucene ShingleAnalyzer to generate one
> and two word terms.

Support for these was (very!) recently checked into svn. Take a look
at the main entry points:

in mahout-examples: o.a.m.text.SequenceFilesFromDirectory
in mahout-utils: o.a.m.text.SparseVectorsFromSequenceFiles

This includes the ability to generate n-grams using Lucene's ShingleFilter

The basic gist here is that SequenceFilesFromDirectory creates one or
more files of SequenceFile<DocId, Document Text> from files found in a
directory, and SparseVectorsFromSequenceFiles does the tokenization
(lucene StandardAnalyzer by default, but this is configurable), n-gram
generation, dictionary generation and document vectorization
(including tfidf calculation). The output from each step is preserved
as a SequenceFile which can be inspected using SequenceFileDumper.

The process of generating the SequenceFile<DocId, DocumentText> is
straightforward enough so that it would be fairly trivial to pull
content from what ever format you have the data in.

A mode in-depth description on the wiki is forthcoming and the usual
caveats apply considering this is brand new code. It sounds like it
would meet your needs. It would be great to have someone else work
with it a bit.

Drew

Re: Suggestions for best approach to classic document clustering

Posted by Ken Krugler <kk...@transpac.com>.

Hi Ted,

Thanks much for the useful response. A few comments/questions inline  
below...

On Feb 10, 2010, at 6:27pm, Ted Dunning wrote:

> I don't think that there is a single best approach.  Here in rough  
> order are
> the approaches that I would try:
>
> 1) carrot2 may be able to handle a few thousand documents.  If so, you
> should be pretty well off because their clustering is typically  
> pretty good
> and it shows even better than it is because it makes inspectability a
> priority.  carrot2 is not, however, designed to scale to very large
> collections according to Dawid.

We originally tried Carrot2, and the results weren't bad. But one  
additional fact I should have mentioned is that I need to be able to  
use the resulting clusters to do subsequent classification of future  
documents. From what I could tell w/Carrot2 and perusing the mailing  
list, this isn't really feasible due to the lack of a centroid from  
the Carrot2 clusters.

> 2) k-means on tf-idf weighted words and bigrams, especially with the
> k-means++ starting point that isn't available yet.  This should also  
> be
> pretty good, but may show a bit worse than it really is because it  
> doesn't
> use the tricks carrot uses.

Is there any support currently in Mahout for generating tf-idf  
weighted vectors without creating a Lucene index? Just curious.

> 3) k-means on SVD representations of documents.  Same comments as  
> (2) except
> that this option depends on *two* pieces of unreleased code (k-means+ 
> + and
> SVD) instead of one.  At this size, you should be able to avoid  
> using the
> mega-SVD that Jake just posted, but that will mean you need to write  
> your
> own interlude between the document vectorizer and a normal in-memory  
> SVD.
> You may have some cooler results here in that documents that share  
> no words
> might be seen as similar, but I expect that overall results for this  
> should
> be very similar as for (2).

I assume you'd use something like the Lucene ShingleAnalyzer to  
generate one and two word terms.

> 4) if you aren't happy by now, invoke plan B and come up with new  
> ideas.
> These might include raw LDA, k-means on LDA document vectors and k- 
> means
> with norms other than L_2 and L_1.
>
> All of this changes a bit if you have some labeled documents.  K-means
> should be pretty easy to extend to deal with that and it can  
> dramatically
> improve results.

I've experimented with using just the results of the Zemanta API,  
which generates keywords from (I believe) calculating similarity with  
sets of Wikipedia pages that share a similar category.

But clustering these keyword-only vectors gave marginal results,  
mostly for cases where the Zemanta results were questionable.

What approach would you take to mix-in keywords such as these with raw  
document data?

Thanks again!

-- Ken

> On Wed, Feb 10, 2010 at 6:04 PM, Ken Krugler <kkrugler_lists@transpac.com 
> >wrote:
>
>> Give the code currently in Mahout (+ Lucene), is there a generally  
>> accepted
>> best approach for clustering of documents?

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Suggestions for best approach to classic document clustering

Posted by Ted Dunning <te...@gmail.com>.

I don't think that there is a single best approach.  Here in rough order are
the approaches that I would try:

1) carrot2 may be able to handle a few thousand documents.  If so, you
should be pretty well off because their clustering is typically pretty good
and it shows even better than it is because it makes inspectability a
priority.  carrot2 is not, however, designed to scale to very large
collections according to Dawid.

2) k-means on tf-idf weighted words and bigrams, especially with the
k-means++ starting point that isn't available yet.  This should also be
pretty good, but may show a bit worse than it really is because it doesn't
use the tricks carrot uses.

3) k-means on SVD representations of documents.  Same comments as (2) except
that this option depends on *two* pieces of unreleased code (k-means++ and
SVD) instead of one.  At this size, you should be able to avoid using the
mega-SVD that Jake just posted, but that will mean you need to write your
own interlude between the document vectorizer and a normal in-memory SVD.
You may have some cooler results here in that documents that share no words
might be seen as similar, but I expect that overall results for this should
be very similar as for (2).

4) if you aren't happy by now, invoke plan B and come up with new ideas.
These might include raw LDA, k-means on LDA document vectors and k-means
with norms other than L_2 and L_1.

All of this changes a bit if you have some labeled documents.  K-means
should be pretty easy to extend to deal with that and it can dramatically
improve results.

On Wed, Feb 10, 2010 at 6:04 PM, Ken Krugler <kk...@transpac.com>wrote:

> Give the code currently in Mahout (+ Lucene), is there a generally accepted
> best approach for clustering of documents?

-- 
Ted Dunning, CTO
DeepDyve