You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Benson Margulies <bi...@gmail.com> on 2009/12/18 00:30:37 UTC

Cluster text docs

Gang,

What's the state of the world on clustering a raft of textual
documents? Are all the pieces in place to start from a directory of
flat text files, push through Lucene to get the vectors, keep labels
on the vectors to point back to the files, and run, say, k-means?

I've got enough data here that skimming off the top few unigrams might
also be advisable.

I tried running this through Weka, and blew it out of virtual memory.

--benson

Re: Cluster text docs

Posted by Benson Margulies <bi...@gmail.com>.

I have a large pile of Hebrew news articles. I want to cluster them so
that I can select a disparate subset for initial tagging of a named
entity extraction model.

On Thu, Dec 17, 2009 at 10:34 PM, Drew Farris <dr...@gmail.com> wrote:
> Hi Benson,
>
> I've managed to go from a lucene index to k-means output with a couple
> smaller corpora. One around 500k items, about 1M total/100k unique
> tokens and another with about half that number of items but with about
> 3M total/300k unique tokens (unigrams in some cases and a mixture of
> unigrams and a limited set of bigrams in another). I ended up doing a
> number of runs with various settings, but somewhat arbitrarily I ended
> up filtering out terms that appeared in less than 8 items. I started
> with 1000 random centroids and ran 10 iterations. These runs were able
> to complete overnight on my minuscule 2 machine cluster I use for
> testing, the probably would have run without a problem without using a
> cluster at all. I never did go back an check to see if they had
> converged before running all 10 iterations.
>
> In each case I had the tools to inject item labels and tokens into a
> lucene index already, so I did not have to use any mahout provided
> tools to set that up. It would be nice to provide a tool that did
> this, but what general-purpose tokenization pipeline should be used?
> In my case I was using a processor based on something developed
> internally for another project.
>
> Nevertheless, the lucene index had a stored field for document labels
> and an tokenized, indexed field with term vectors stored from which
> the tokens were extracted. Using o.a.m.utils.vectors.lucene.Driver, I
> was able to produce vectors suitable as a starting point for k-means.
>
> After running, k-means emits cluster and point data. Everything can be
> dumped using o.a.m.utils.clustering.ClusterDumper, which takes the
> clustering output and the dictionary file produced by the
> lucene.Driver and produces a text file containing what I believe to be
> a gson(?) representation of the SparseVector representing the centroid
> of the cluster (need to verify this), the top terms found in the
> cluster,  and the labels of the items that fell into that cluster.
> I've managed to opened up the ClusterDumper code and produce something
> that emits documents and their cluster assignments to support the
> investigation I'm doing.
>
> I have not done an exhaustive amount validation on the output, but
> based on what I have done, the results look very promising.
>
> I've tried to run LDA on the same corpora, but haven't met with any
> success. I'm under the impression that I'm either doing something
> horribly wrong, or the scaling characteristics of the algorithm are
> quite different than k-means. I haven't managed to get my head around
> the algorithm or read the code enough to figure out what the problem
> could be at this point.
>
> What are the characteristics of the collection of documents are you
> attempting to cluster?
>
> Drew
>
> On Thu, Dec 17, 2009 at 6:30 PM, Benson Margulies <bi...@gmail.com> wrote:
>> Gang,
>>
>> What's the state of the world on clustering a raft of textual
>> documents? Are all the pieces in place to start from a directory of
>> flat text files, push through Lucene to get the vectors, keep labels
>> on the vectors to point back to the files, and run, say, k-means?
>>
>> I've got enough data here that skimming off the top few unigrams might
>> also be advisable.
>>
>> I tried running this through Weka, and blew it out of virtual memory.
>>
>> --benson
>>
>

Re: Cluster text docs

Posted by Drew Farris <dr...@gmail.com>.

Hi Benson,

I've managed to go from a lucene index to k-means output with a couple
smaller corpora. One around 500k items, about 1M total/100k unique
tokens and another with about half that number of items but with about
3M total/300k unique tokens (unigrams in some cases and a mixture of
unigrams and a limited set of bigrams in another). I ended up doing a
number of runs with various settings, but somewhat arbitrarily I ended
up filtering out terms that appeared in less than 8 items. I started
with 1000 random centroids and ran 10 iterations. These runs were able
to complete overnight on my minuscule 2 machine cluster I use for
testing, the probably would have run without a problem without using a
cluster at all. I never did go back an check to see if they had
converged before running all 10 iterations.

In each case I had the tools to inject item labels and tokens into a
lucene index already, so I did not have to use any mahout provided
tools to set that up. It would be nice to provide a tool that did
this, but what general-purpose tokenization pipeline should be used?
In my case I was using a processor based on something developed
internally for another project.

Nevertheless, the lucene index had a stored field for document labels
and an tokenized, indexed field with term vectors stored from which
the tokens were extracted. Using o.a.m.utils.vectors.lucene.Driver, I
was able to produce vectors suitable as a starting point for k-means.

After running, k-means emits cluster and point data. Everything can be
dumped using o.a.m.utils.clustering.ClusterDumper, which takes the
clustering output and the dictionary file produced by the
lucene.Driver and produces a text file containing what I believe to be
a gson(?) representation of the SparseVector representing the centroid
of the cluster (need to verify this), the top terms found in the
cluster,  and the labels of the items that fell into that cluster.
I've managed to opened up the ClusterDumper code and produce something
that emits documents and their cluster assignments to support the
investigation I'm doing.

I have not done an exhaustive amount validation on the output, but
based on what I have done, the results look very promising.

I've tried to run LDA on the same corpora, but haven't met with any
success. I'm under the impression that I'm either doing something
horribly wrong, or the scaling characteristics of the algorithm are
quite different than k-means. I haven't managed to get my head around
the algorithm or read the code enough to figure out what the problem
could be at this point.

What are the characteristics of the collection of documents are you
attempting to cluster?

Drew

On Thu, Dec 17, 2009 at 6:30 PM, Benson Margulies <bi...@gmail.com> wrote:
> Gang,
>
> What's the state of the world on clustering a raft of textual
> documents? Are all the pieces in place to start from a directory of
> flat text files, push through Lucene to get the vectors, keep labels
> on the vectors to point back to the files, and run, say, k-means?
>
> I've got enough data here that skimming off the top few unigrams might
> also be advisable.
>
> I tried running this through Weka, and blew it out of virtual memory.
>
> --benson
>

Re: Cluster text docs

Posted by Drew Farris <dr...@gmail.com>.

On Sat, Dec 19, 2009 at 11:15 AM, Benson Margulies
<bi...@gmail.com> wrote:

> I've got the vectors built now, I need to find my way to actually
> running the kmeans process.

Take a look at o.a.m.clustering.kmeans.ClusterDriver in core.

Re: Cluster text docs

Posted by Benson Margulies <bi...@gmail.com>.

OK, I'm a bit lost. I don't see any scripts in the examples directory.

There's the code using canopy that confused me months ago:-)

I've got the vectors built now, I need to find my way to actually
running the kmeans process.


On Fri, Dec 18, 2009 at 6:56 AM, Isabel Drost <is...@apache.org> wrote:
> On Fri Shashikant Kore <sh...@gmail.com> wrote:
>
>> Mahout doesn't have code to generate lucene index. We assume you have
>> already created the index.
>
> You can have a look at the script in mahout-examples that demos the LDA
> implementation. There is also documentation available in the wiki:
>
> http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
>
> Isabel
>

Re: Cluster text docs

Posted by Isabel Drost <is...@apache.org>.

On Fri Shashikant Kore <sh...@gmail.com> wrote:

> Mahout doesn't have code to generate lucene index. We assume you have
> already created the index. 

You can have a look at the script in mahout-examples that demos the LDA
implementation. There is also documentation available in the wiki:

http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html

Isabel

Re: Cluster text docs

Posted by Drew Farris <dr...@gmail.com>.

Felix,

If you are doing clustering/topic mapping and you have the time, you
might first give it a try with stemmed unigrams and bigrams with
stopwords removed. The results of a simple approach such as this may
be sufficient for your needs. At the very least it provides a baseline
for further experimentation.

Re: Cluster text docs

Posted by Benson Margulies <bi...@gmail.com>.

Now I got some real vectors, and KMeans ran, but the cluster dumper
produces no output.

Does it eat the entire 'output' of the KMeansDriver?

I'm going to try the canopy process next.


On Sat, Dec 19, 2009 at 4:51 PM, Benson Margulies <bi...@gmail.com> wrote:
> I'm not doing too well here.
>
> I followed the instructions supplied here, with a current .3 dev tree
> and the correct hadoop, and got the following. I definitely my
> 0.3-SNAPSHOT Driver from my local build. This is assuming, of course,
> that I can just feed the vectors from
> org.apache.mahout.utils.vectors.lucene.Driver straight into the sample
> KMeans job. Maybe I need to feed them directly to the KMeans class,
> instead?
>
> Preparing Input
> 09/12/19 16:45:58 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=JobTracker, sessionId=
> 09/12/19 16:45:58 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the
> same.
> 09/12/19 16:45:59 INFO mapred.FileInputFormat: Total input paths to process : 1
> 09/12/19 16:45:59 INFO mapred.JobClient: Running job: job_local_0001
> 09/12/19 16:45:59 INFO mapred.FileInputFormat: Total input paths to process : 1
> 09/12/19 16:45:59 INFO mapred.MapTask: numReduceTasks: 0
> 09/12/19 16:45:59 WARN mapred.LocalJobRunner: job_local_0001
> java.lang.NumberFormatException: For input string:
> "SEQ!org.apache.hadoop.io.LongWritable#org.apache.mahout.math.SparseVector*org.apache.hadoop.io.compress.DefaultCodec?A?d??"
>        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1224)
>        at java.lang.Double.valueOf(Double.java:475)
>        at org.apache.mahout.clustering.syntheticcontrol.canopy.InputMapper.map(InputMapper.java:51)
>        at org.apache.mahout.clustering.syntheticcontrol.canopy.InputMapper.map(InputMapper.java:36)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
>

Re: Cluster text docs

Posted by Benson Margulies <bi...@gmail.com>.

I'm not doing too well here.

I followed the instructions supplied here, with a current .3 dev tree
and the correct hadoop, and got the following. I definitely my
0.3-SNAPSHOT Driver from my local build. This is assuming, of course,
that I can just feed the vectors from
org.apache.mahout.utils.vectors.lucene.Driver straight into the sample
KMeans job. Maybe I need to feed them directly to the KMeans class,
instead?

Preparing Input
09/12/19 16:45:58 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
09/12/19 16:45:58 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the
same.
09/12/19 16:45:59 INFO mapred.FileInputFormat: Total input paths to process : 1
09/12/19 16:45:59 INFO mapred.JobClient: Running job: job_local_0001
09/12/19 16:45:59 INFO mapred.FileInputFormat: Total input paths to process : 1
09/12/19 16:45:59 INFO mapred.MapTask: numReduceTasks: 0
09/12/19 16:45:59 WARN mapred.LocalJobRunner: job_local_0001
java.lang.NumberFormatException: For input string:
"SEQ!org.apache.hadoop.io.LongWritable#org.apache.mahout.math.SparseVector*org.apache.hadoop.io.compress.DefaultCodec?A?d??"
	at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1224)
	at java.lang.Double.valueOf(Double.java:475)
	at org.apache.mahout.clustering.syntheticcontrol.canopy.InputMapper.map(InputMapper.java:51)
	at org.apache.mahout.clustering.syntheticcontrol.canopy.InputMapper.map(InputMapper.java:36)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)

Re: Cluster text docs

Posted by Felix Lange <fx...@googlemail.com>.

Hi ,
Ted, I agree, sentences don't need to be grammatical for our purposes. My
intention was just to cut out noun-less phrase like "very good". I just
think in general nouns say more about a topic than adjectives and so I can
leave them aside and make the feature vector a bit smaller.
@ Drew: Yes, we actually did some testing on unigrams, and the result
weren't that bad.

Greetings
Felix



2009/12/19 Ted Dunning <te...@gmail.com>

> I think you are making a very big (and very wrong) assumption here.
>
> The non-grammaticality of these chunks does not generally adversely affect
> topic identification and can actually help it quite a bit.
>
> It is important to avoid "everybody knows" facts in your development at
> this
> point.  Even if everybody you talk to agrees that you don't even need to
> look at the data on this topic, you should still be suspicious of strong
> statements without data.
>
> On Sat, Dec 19, 2009 at 8:16 AM, Felix Lange <fx...@googlemail.com>
> wrote:
>
> > In particular, I have a question about building n-grams (subsets) from
> > noun-chunks. In the
> > power-sets of noun-chunks, we don't want to have subsets like "world's
> > first". That would surely spoil the clustering. Every subset should
> include
> > the grammatical core of the chunk, in this example, "aircraft".
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Cluster text docs

Posted by Ted Dunning <te...@gmail.com>.

I think you are making a very big (and very wrong) assumption here.

The non-grammaticality of these chunks does not generally adversely affect
topic identification and can actually help it quite a bit.

It is important to avoid "everybody knows" facts in your development at this
point.  Even if everybody you talk to agrees that you don't even need to
look at the data on this topic, you should still be suspicious of strong
statements without data.

On Sat, Dec 19, 2009 at 8:16 AM, Felix Lange <fx...@googlemail.com> wrote:

> In particular, I have a question about building n-grams (subsets) from
> noun-chunks. In the
> power-sets of noun-chunks, we don't want to have subsets like "world's
> first". That would surely spoil the clustering. Every subset should include
> the grammatical core of the chunk, in this example, "aircraft".
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Cluster text docs

Posted by Felix Lange <fx...@googlemail.com>.

Hi there,

I would like to add some thoughts about feature selection to this
discussion.
I'm working on the topic-clustering project at the tu-berlin, that has
already been discussed in this mailing-lst (e.g.
http://mail-archives.apache.org/mod_mbox/lucene-mahout-user/200911.mbox/%3c5eb9b7ae0911060506n3b60dbfdmd34e41fc3db95c45@mail.gmail.com%3e).

Choosing the right feature-extraction and clustering algorithms is one part
of the story, but what should be the input to this algorithms in the first
place?

In a thread about preprocessing, my colleague Marc presented our UIMA-based
approach(
http://mail-archives.apache.org/mod_mbox/lucene-mahout-user/200911.mbox/%3c4B1180BD.3010208@marc-hofer.de%3e).
Tu
sum this up, our pipeline implemens the following preprocessing-steps:
stripping of html-tags>pos-tagging and noun-group-chunking, both via
wrappers for lingpipe-annotators > stemming > stoppword-filtering. So we
could actually pass stemmed words without stopwords to the
feature-extactor. but there are more effective (and probably less
data-intensive) possiblities.
Think about a sentence like this one:
(1) "The Me 262 is well known as the world's first fighter aircraft with a
jet engine".
If you do topic-clustering, which words give a proper representation of this
sentence's topic? a good guess seems to be to take the non phrases, i.e.
"The Me 262" , "the world's first fighter aircraft", and "a jet engine". Our
noun chunker can easily achieve this, if we include number words (262) into
the set of grammatical categories occurring inside a noun phrase. But if we
stop here, we miss a generalization: a text with a chunk "fighter aircrafts"
probably has the same topic. but if we pass them over as an atomic feature,
we end up without a match, because this chunk is not string-identical to
"the world's first fighter aircraft". To make the
feature-extractor/clusterer reecognize the similarity we do the following:
stemming (strips off the "s"), excluding determiners ("the") inside chunks,
and building the power-set from every chunk, that reflects the grammatical
structure. for "the world's first fighter aircraft", we end up with the
set{"world's first fighter aircraft","first fighter aircraft", "fighter
aircraft" ,"aircraft"}, thus detecting the similarity to chunk "fighter
aircrafts" (after stemming, that is). One could argue: Why take complete
noun-chunks in the first place, when they cannot be easily matched with
other phrases? This is because noun groups can carry meanings that cannot be
calculated from their parts. For example, a chunk "bag of words" offers an
excellent gues as to what is article is about (namely, text processing). But
that is not clear if you only look at the single words "bag" , "of" and
"words".

As for the words that are not nouns or parts of noun-chunks, many of them
can be left aside. For example, a word like "good" is not that specific
when it comes to topic clustering. "good" is an adjective, "aircraft" is a
noun. so a selection of topic-specific words can be done on the basis of
grammatical categories. that's what we have the POS-Tagger for.

Any comments on this approach are of course welcome. In particular, I have
a question about building n-grams (subsets) from noun-chunks. In the
power-sets of noun-chunks, we don't want to have subsets like "world's
first". That would surely spoil the clustering. Every subset should include
the grammatical core of the chunk, in this example, "aircraft". Lingpipe's
noun-chunker is not able to do this, because it's based on a sequential
parse of the POS-Tags. If you have a chunk "wizard of warcraft", the core of
the chunk is "wizard", appearing on the outer left of the chunk. In order to
detect it, we need a deep parser. But this seems to be much costly. On an
off-the-shelf dual-core computer with 4 gigs of memory, we can do the
preprocessing of this e-mail within half of a second. That would change
dramatically if we would use a deep-parser. Or am I wrong?

Greetings,
Felix

Re: Cluster text docs

Posted by Grant Ingersoll <gs...@apache.org>.

I don't know of any benchmarks other than what David H. has run.  Would be good to get some setup (as with all the Mahout algorithms, actually).


On Dec 18, 2009, at 10:03 AM, Levy, Mark wrote:

> Hi Drew,
> 
> Below is a mail I sent to this list a while back.  Is this consistent with your experience?
> 
> Cheers,
> 
> Mark
> 
> 
> On Sep 23, 2009, at 6:05 AM, Levy, Mark wrote:
> 
>> I've started to experiment with LDA and am finding that it creates  
>> only
>> a single long-running map task for each iteration, which doesn't scale
>> well.  The map is taking 20mins for 10k of my input SparseVectors,  
>> and 5
>> hours for 100k (the vocabulary size also grows when there are more
>> vectors).
>> 
>> Is this expected or am I doing something wrong?  Are there any  
>> existing
>> performance benchmarks?
>> 
> 
> 
>> -----Original Message-----
>> From: Drew Farris [mailto:drew.farris@gmail.com]
>> Sent: 18 December 2009 13:59
>> To: mahout-user@lucene.apache.org
>> Subject: Re: Cluster text docs
>> 
>> Hi Shashi,
>> 
>> On Fri, Dec 18, 2009 at 1:36 AM, Shashikant Kore <sh...@gmail.com>
>> wrote:
>> 
>>> (.. cluster assignment is already there. Wonder why you had to redo
>>> it.)
>> 
>> Ahh, yes. I didn't have to re-do it, but I did wanted to learn the
>> internal structure of the data files and to point out that it was easy
>> enough to achieve. The code is quite straightforward.
>> 
>>> Drew, are you using the latest code? Overnight sounds too long.
>> 
>> That's good to know. This was a couple month or two ago before the
>> matrix/math stuff was rolled in. I'll collect exact times on the next
>> run I do.
>> 
>> Has anyone else run LDA outside of the canned Reuters example? I would
>> be interested to hear about corpus characteristics and processing
>> power required to successfully produce LDA clusters. I've had all
>> sorts of issues, but mostly related to hadoop configuration nits
>> related to my environment however
>> 
>> Drew

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Cluster text docs

Posted by Drew Farris <dr...@gmail.com>.

Just to follow up on this a bit:

The LDA example shipped with mahout performs clustering on tokens from
the 21578 document Reuters test set, but only looks at single term
tokens that appear in more than 100 documents. This works out to be
2075 unique terms, which is considerably smaller than what I'd been
looking at, and smaller than what Mark describes below as well.

On Fri, Dec 18, 2009 at 10:03 AM, Levy, Mark <ma...@last.fm> wrote:
> Hi Drew,
>
> Below is a mail I sent to this list a while back.  Is this consistent with your experience?
>
> Cheers,
>
> Mark
>
>
> On Sep 23, 2009, at 6:05 AM, Levy, Mark wrote:
>
>> I've started to experiment with LDA and am finding that it creates
>> only
>> a single long-running map task for each iteration, which doesn't scale
>> well.  The map is taking 20mins for 10k of my input SparseVectors,
>> and 5
>> hours for 100k (the vocabulary size also grows when there are more
>> vectors).
>>
>> Is this expected or am I doing something wrong?  Are there any
>> existing
>> performance benchmarks?
>>
>
>
>> -----Original Message-----
>> From: Drew Farris [mailto:drew.farris@gmail.com]
>> Sent: 18 December 2009 13:59
>> To: mahout-user@lucene.apache.org
>> Subject: Re: Cluster text docs
>>
>> Hi Shashi,
>>
>> On Fri, Dec 18, 2009 at 1:36 AM, Shashikant Kore <sh...@gmail.com>
>> wrote:
>>
>> > (.. cluster assignment is already there. Wonder why you had to redo
>> > it.)
>>
>> Ahh, yes. I didn't have to re-do it, but I did wanted to learn the
>> internal structure of the data files and to point out that it was easy
>> enough to achieve. The code is quite straightforward.
>>
>> > Drew, are you using the latest code? Overnight sounds too long.
>>
>> That's good to know. This was a couple month or two ago before the
>> matrix/math stuff was rolled in. I'll collect exact times on the next
>> run I do.
>>
>> Has anyone else run LDA outside of the canned Reuters example? I would
>> be interested to hear about corpus characteristics and processing
>> power required to successfully produce LDA clusters. I've had all
>> sorts of issues, but mostly related to hadoop configuration nits
>> related to my environment however
>>
>> Drew
>

RE: Cluster text docs

Posted by "Levy, Mark" <ma...@last.fm>.

Hi Drew,

Below is a mail I sent to this list a while back.  Is this consistent with your experience?

Cheers,

Mark


On Sep 23, 2009, at 6:05 AM, Levy, Mark wrote:

> I've started to experiment with LDA and am finding that it creates  
> only
> a single long-running map task for each iteration, which doesn't scale
> well.  The map is taking 20mins for 10k of my input SparseVectors,  
> and 5
> hours for 100k (the vocabulary size also grows when there are more
> vectors).
>
> Is this expected or am I doing something wrong?  Are there any  
> existing
> performance benchmarks?
>


> -----Original Message-----
> From: Drew Farris [mailto:drew.farris@gmail.com]
> Sent: 18 December 2009 13:59
> To: mahout-user@lucene.apache.org
> Subject: Re: Cluster text docs
> 
> Hi Shashi,
> 
> On Fri, Dec 18, 2009 at 1:36 AM, Shashikant Kore <sh...@gmail.com>
> wrote:
> 
> > (.. cluster assignment is already there. Wonder why you had to redo
> > it.)
> 
> Ahh, yes. I didn't have to re-do it, but I did wanted to learn the
> internal structure of the data files and to point out that it was easy
> enough to achieve. The code is quite straightforward.
> 
> > Drew, are you using the latest code? Overnight sounds too long.
> 
> That's good to know. This was a couple month or two ago before the
> matrix/math stuff was rolled in. I'll collect exact times on the next
> run I do.
> 
> Has anyone else run LDA outside of the canned Reuters example? I would
> be interested to hear about corpus characteristics and processing
> power required to successfully produce LDA clusters. I've had all
> sorts of issues, but mostly related to hadoop configuration nits
> related to my environment however
> 
> Drew

Re: Cluster text docs

Posted by Drew Farris <dr...@gmail.com>.

Hi Shashi,

On Fri, Dec 18, 2009 at 1:36 AM, Shashikant Kore <sh...@gmail.com> wrote:

> (.. cluster assignment is already there. Wonder why you had to redo
> it.)

Ahh, yes. I didn't have to re-do it, but I did wanted to learn the
internal structure of the data files and to point out that it was easy
enough to achieve. The code is quite straightforward.

> Drew, are you using the latest code? Overnight sounds too long.

That's good to know. This was a couple month or two ago before the
matrix/math stuff was rolled in. I'll collect exact times on the next
run I do.

Has anyone else run LDA outside of the canned Reuters example? I would
be interested to hear about corpus characteristics and processing
power required to successfully produce LDA clusters. I've had all
sorts of issues, but mostly related to hadoop configuration nits
related to my environment however

Drew

Re: Cluster text docs

Posted by Shashikant Kore <sh...@gmail.com>.

I have done it.

Mahout doesn't have code to generate lucene index. We assume you have
already created the index. You can create vectors from lucene index
easily, run k-means, and use ClusterDumper to get the clusters, the
documents in that cluster and top features from centroid vector.
(Drew, cluster assignment is already there. Wonder why you had to redo
it.)

I have run k-means with quarter a million documents, each with 200
features (on an average). I don't recall the total number of features
in corpus, but I suspect with the optimizations to distance
calculations, it doesn't affect performance. Also, during vector
generations, the terms which are too frequent or too rare are ignored.
 I am able to run clustering on this set (100 random centroids, 10
iterations) in less than 30 minutes on a single host.

Drew, are you using the latest code? Overnight sounds too long.

--shashi

On Fri, Dec 18, 2009 at 5:00 AM, Benson Margulies <bi...@gmail.com> wrote:
> Gang,
>
> What's the state of the world on clustering a raft of textual
> documents? Are all the pieces in place to start from a directory of
> flat text files, push through Lucene to get the vectors, keep labels
> on the vectors to point back to the files, and run, say, k-means?
>
> I've got enough data here that skimming off the top few unigrams might
> also be advisable.
>
> I tried running this through Weka, and blew it out of virtual memory.
>
> --benson
>