You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2009/07/14 15:41:09 UTC

Re: Validating clustering output

Ted,

On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:

> A principled approach to cluster evaluation is to measure how well the
> cluster membership captures the structure of unseen data.  A natural  
> measure
> for this is to measure how much of the entropy of the data is  
> captured by
> cluster membership.  For k-means and its natural L_2 metric, the  
> natural
> cluster quality metric is the squared distance from the nearest  
> centroid
> adjusted by the log_2 of the number of clusters.  This can be  
> compared to
> the squared magnitude of the original data or the squared deviation  
> from the
> centroid for all of the data.  The idea is that you are changing the
> representation of the data by allocating some of the bits in your  
> original
> representation to represent which cluster each point is in.  If  
> those bits
> aren't made up by the residue being small then your clustering is  
> making a
> bad trade-off.
>
> In the past, I have used other more heuristic measures as well.  One  
> of the
> key characteristics that I would like to see out of a clustering is  
> a degree
> of stability.  Thus, I look at the fractions of points that are  
> assigned to
> each cluster or the distribution of distances from the cluster  
> centroid.
> These values should be relatively stable when applied to held-out  
> data.
>
> For text, you can actually compute perplexity which measures how well
> cluster membership predicts what words are used.  This is nice  
> because you
> don't have to worry about the entropy of real valued numbers.

Do you have any references on any of the above approaches?

Thanks,
Grant

Re: Validating clustering output

Posted by Benson Margulies <bi...@gmail.com>.
On Tue, Jul 28, 2009 at 6:55 AM, Grant Ingersoll<gs...@apache.org> wrote:
>
> On Jul 28, 2009, at 12:48 AM, Ted Dunning wrote:
>
>>
>> I owe the IBM team my interest in statistical approaches to AI and
>> symbolic
>> sequences.  It was on a visit to IBM in 1990 or so that Stephen (or
>> Vincent)
>> dP mentioned off-handedly to me that mutual information was "trivially
>> known
>> to be chi-squared distributed asymptotically".
>
> I love statements like these!  Takes me back to the good old Math days of
> "We'll leave it as an exercise to the reader" or proofs that start off by
> saying "It is trivial to prove ..., so we'll proceed to the main part of the
> proof" and, as a 20 year old Math student you spend the next day beating
> your head against the wall because it is anything but trivial to you!

And, indeed, the paper that started this thread is a shining example
of that sort of thing from the point of view of actual programming.
The 'description' of how to get from the O(5) obvious to something
usable is largely notable for what it does not say.

>
> -Grant
>

Re: Validating clustering output

Posted by Ted Dunning <te...@gmail.com>.
To be fair, it was a trivial result.  If you start from some very deep
theorems.  :-)

On Tue, Jul 28, 2009 at 3:55 AM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Jul 28, 2009, at 12:48 AM, Ted Dunning wrote:
>
>
>> I owe the IBM team my interest in statistical approaches to AI and
>> symbolic
>> sequences.  It was on a visit to IBM in 1990 or so that Stephen (or
>> Vincent)
>> dP mentioned off-handedly to me that mutual information was "trivially
>> known
>> to be chi-squared distributed asymptotically".
>>
>
> I love statements like these!  Takes me back to the good old Math days of
> "We'll leave it as an exercise to the reader" or proofs that start off by
> saying "It is trivial to prove ..., so we'll proceed to the main part of the
> proof" and, as a 20 year old Math student you spend the next day beating
> your head against the wall because it is anything but trivial to you!
>
> -Grant
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Validating clustering output

Posted by Grant Ingersoll <gs...@apache.org>.
On Jul 28, 2009, at 12:48 AM, Ted Dunning wrote:

>
> I owe the IBM team my interest in statistical approaches to AI and  
> symbolic
> sequences.  It was on a visit to IBM in 1990 or so that Stephen (or  
> Vincent)
> dP mentioned off-handedly to me that mutual information was  
> "trivially known
> to be chi-squared distributed asymptotically".

I love statements like these!  Takes me back to the good old Math days  
of "We'll leave it as an exercise to the reader" or proofs that start  
off by saying "It is trivial to prove ..., so we'll proceed to the  
main part of the proof" and, as a 20 year old Math student you spend  
the next day beating your head against the wall because it is anything  
but trivial to you!

-Grant

Re: Validating clustering output

Posted by Ted Dunning <te...@gmail.com>.
On Mon, Jul 27, 2009 at 6:51 PM, Benson Margulies <bi...@gmail.com>wrote:

> [brown and mercer did hard stuff] Of course, you aren't proposing that,
> just
> recommending the bigram entropy metric or something like it.
>

Peter Brown and Bob Mercer were very sharp dudes and when they did this work
it was 100 times more amazing than it is now.  They had the advantage of
working for a company that understood that the resources that you give
researchers now should be 20 times more than you would expect a user to have
in 5 years, but even so, their achievements were quite something.

Frankly that record of achievement leads back beyond them to Fred Jelinek,
Lalit Bahl and Selim Roukos and all the other early guys who worked on
speech back then.  That work (along with the BBN team under Jim and Janet
Baker) gave us the entire framework of HMM's and entropy based evaluation
that is core to speech systems today.  It leads forward to some of the
really fabulous work that the della Pietra brothers did as well.

I owe the IBM team my interest in statistical approaches to AI and symbolic
sequences.  It was on a visit to IBM in 1990 or so that Stephen (or Vincent)
dP mentioned off-handedly to me that mutual information was "trivially known
to be chi-squared distributed asymptotically".  That was news to me and
formed the basis of a LOT of the work that I have done in the intervening 19
years.



-- 
Ted Dunning, CTO
DeepDyve

Re: Validating clustering output

Posted by Benson Margulies <bi...@gmail.com>.
Brown and DiPietro's algorithm for clustering based on entropy is
somewhat infamous for the difficulty of achieving usable performance.
Mike Collins was responsible for a famously speedy version. Having
build one that is just barely fast enough in C++, I wouldn't recommend
trying it in Java. Of course, you aren't proposing that, just
recommending the bigram entropy metric or something like it.

On Mon, Jul 27, 2009 at 9:42 PM, Ted Dunning<te...@gmail.com> wrote:
> (vastly delayed response ... huge distractions competing with more than 2
> minutes answers are to blame)
>
> Grant,
>
> For evaluating clustering for symbol sequences:
>
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.7275
>
> Most of the other references I have found talk about quality relative to
> gold standard judgments about whether exemplars are in the same class or
> relative to similarity/distinctiveness ratios.  Neither is all that
> satisfactory.
>
> My preference is an entropic measure that describes how much of the
> information in your data is captured by the clustering vs how much residual
> info there is.
>
> The other reference I am looking for may be in David Mackay's book.  The
> idea is that you measure the quality of the approximation by looking at the
> entropy in the cluster assignment relative to the residual required to
> precisely specify the original data relative to the quantized value.
>
> This is also related to trading off signal/noise in a vector quantizer.
>
> David,  do you have a moment to talk about this with me?  I can't free up
> the time to chase these final references and come up with a nice formula for
> this.  I think you could do it in 10-20 minutes.
>
> On Tue, Jul 14, 2009 at 6:41 AM, Grant Ingersoll <gs...@apache.org>wrote:
>
>> On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:
>>
>>  A principled approach to cluster evaluation is to measure how well the
>>> cluster membership captures the structure of unseen data.  A natural
>>> measure
>>> for this is to measure how much of the entropy of the data is captured by
>>> cluster membership.  For k-means and its natural L_2 metric, the natural
>>> cluster quality metric is the squared distance from the nearest centroid
>>> adjusted by the log_2 of the number of clusters.  This can be compared to
>>> the squared magnitude of the original data or the squared deviation from
>>> the
>>> centroid for all of the data.  The idea is that you are changing the
>>> representation of the data by allocating some of the bits in your original
>>> representation to represent which cluster each point is in.  If those bits
>>> aren't made up by the residue being small then your clustering is making a
>>> bad trade-off.
>>>
>>> In the past, I have used other more heuristic measures as well.  One of
>>> the
>>> key characteristics that I would like to see out of a clustering is a
>>> degree
>>> of stability.  Thus, I look at the fractions of points that are assigned
>>> to
>>> each cluster or the distribution of distances from the cluster centroid.
>>> These values should be relatively stable when applied to held-out data.
>>>
>>> For text, you can actually compute perplexity which measures how well
>>> cluster membership predicts what words are used.  This is nice because you
>>> don't have to worry about the entropy of real valued numbers.
>>>
>>
>> Do you have any references on any of the above approaches?
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> http://www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)
>

Re: Validating clustering output

Posted by Ted Dunning <te...@gmail.com>.
These all depend on gold standards.  If you have those, then it is easy to
evaluate a clustering.

What is hard is to evaluate a clustering without a standard.  I have done
this, somewhat, in the past by looking at stability over time in terms of
cluster size and membership.  I have also looked at the utility of cluster
membership in predicting objective attributes not used in the clustering.
The stability criteria might apply to some of our data sets.  The utility
measure only works in a modeling setting.

On Tue, Aug 18, 2009 at 7:32 AM, Grant Ingersoll <gs...@apache.org>wrote:

> Also found:
> http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
>
>
> On Aug 18, 2009, at 9:55 AM, Grant Ingersoll wrote:
>
>
>> On Jul 27, 2009, at 9:42 PM, Ted Dunning wrote:
>>
>>  The other reference I am looking for may be in David Mackay's book.  The
>>> idea is that you measure the quality of the approximation by looking at
>>> the
>>> entropy in the cluster assignment relative to the residual required to
>>> precisely specify the original data relative to the quantized value.
>>>
>>
>> Is the WM Rand paper in JSTOR ("Object Criteria for Evaluation of
>> Clustering Methods") worthwhile on this topic?  Basic searches for
>> "evaluating clustering" or "cluster evaluation" on Google Scholar turn up
>> very little.  The Rand paper is from 1971, but who knows...
>>
>> Of course, I'd like something that doesn't require purchase (sigh.)
>>
>
>
>


-- 
Ted Dunning, CTO
DeepDyve

Re: AW: Validating clustering output

Posted by Grant Ingersoll <gs...@apache.org>.
Hi Benjamin,

Please start a separate thread with an appropriate subject, as you  
will be much more likely to get answers for your question.

-Grant

On Aug 18, 2009, at 11:37 AM, Benjamin Dageroth wrote:

> I just installed Mahout on my windows machine and wanted to try out  
> the taste example with the grouplens data. Although I seem to have  
> done everything according to the suggested instructions at http://lucene.apache.org/mahout/taste.html#demo 
>  - However, I cannot get the webapp running and get a 503 message:  
> Service unavailable. When starting jetty, the servlet Container  
> accompanying the demo, it goes through and boasts that it started  
> Jetty Server, but during startup it lets me know that there is an  
> exception, which I suppose will be the culprit.
>
> java.net.URISyntaxException: Illegal character in path at index 18:  
> file:/C:/Dokumente und Einstellungen/bda/.m2/repository/org/mortbay/ 
> jetty/jetty-maven-plugin/7.0.0.1beta3/jetty-maven- 
> plugin-7.0.0.1beta3.jar
> at java.net.URI$Parser.fail<URI.java:2089>
> at java.net.URI$Parser.checkChars<URI.java:2982>
> at java.net.URI$Parser.parseHierarchical<URI.java:3066>
> at java.net.URI$Parser.parse<URI.java:3014>
> at java.net.URI.<init><URI.java:578>
> at java.net.URL.toURI<URL.java:918>
> ...
> Etc.
> The complete log of the startup process can be found further down. I  
> would guess that empty spaces might pose a problem, but I am not  
> sure what I can do about that when the home directory of a user is  
> used which is always filed under c:\dokumente und Einstellungen\ and  
> maven goes to look there.
>
> Any Idea where I can change the path, in case that this is indeed  
> the problem? Otherwise, what is my problem? ;-)
>
> Thanks a lot,
>
> Benjamin
>
> ------------------------------------------------------------------
> Complete Log:
> $ /cygdrive/c/workspace/maven/apache-maven-2.2.0/bin/mvn jetty:run-war
> [INFO] Scanning for projects...
> [INFO]  
> ------------------------------------------------------------------------
> [INFO] Building Mahout Taste Webapp
> [INFO]    task-segment: [jetty:run-war]
> [INFO]  
> ------------------------------------------------------------------------
> [INFO] Preparing jetty:run-war
> [INFO] [resources:resources {execution: default-resources}]
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] Copying 4 resources
> [INFO] Copying 1 resource to c:\workspace\Mahout for Zanox\taste-web 
> \target/maho
> ut-taste-webapp-0.2-SNAPSHOT/WEB-INF/lib
> [INFO] [resources:copy-resources {execution: copy-resources}]
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] Copying 3 resources
> [INFO] [compiler:compile {execution: default-compile}]
> [INFO] Nothing to compile - all classes are up to date
> [INFO] [resources:testResources {execution: default-testResources}]
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] skip non existing resourceDirectory c:\workspace\Mahout for  
> Zanox\taste-w
> eb\src\test\resources
> [INFO] [compiler:testCompile {execution: default-testCompile}]
> [INFO] Nothing to compile - all classes are up to date
> [INFO] [surefire:test {execution: default-test}]
> [INFO] No tests to run.
> [INFO] [war:war {execution: default-war}]
> [INFO] Packaging webapp
> [INFO] Assembling webapp[mahout-taste-webapp] in [c:\workspace 
> \Mahout for Zanox\
> taste-web\target\mahout-taste-webapp-0.2-SNAPSHOT]
> [INFO] Dependency[Dependency {groupId=org.apache.mahout,  
> artifactId=mahout-core,
> version=0.2-SNAPSHOT, type=jar}] has changed (was Dependency  
> {groupId=org.apach
> e.mahout, artifactId=mahout-core, version=0.2-SNAPSHOT, type=jar}).
> [INFO] Dependency[Dependency {groupId=axis, artifactId=axis,  
> version=1.4, type=j
> ar}] has changed (was Dependency {groupId=axis, artifactId=axis,  
> version=1.4, ty
> pe=jar}).
> [INFO] Dependency[Dependency {groupId=javax.servlet,  
> artifactId=servlet-api, ver
> sion=2.4, type=jar}] has changed (was Dependency  
> {groupId=javax.servlet, artifac
> tId=servlet-api, version=2.4, type=jar}).
> [INFO] Dependency[Dependency {groupId=org.slf4j, artifactId=slf4j- 
> api, version=1
> .5.6, type=jar}] has changed (was Dependency {groupId=org.slf4j,  
> artifactId=slf4
> j-api, version=1.5.6, type=jar}).
> [INFO] Dependency[Dependency {groupId=org.slf4j, artifactId=slf4j- 
> jcl, version=1
> .5.6, type=jar}] has changed (was Dependency {groupId=org.slf4j,  
> artifactId=slf4
> j-jcl, version=1.5.6, type=jar}).
> [INFO] Processing war project
> [INFO] Copying webapp resources[c:\workspace\Mahout for Zanox\taste- 
> web\src\main
> \webapp]
> [INFO] Webapp assembled in[94 msecs]
> [INFO] Building war: c:\workspace\Mahout for Zanox\taste-web\target 
> \mahout-taste
> -webapp-0.2-SNAPSHOT.war
> [INFO] [jetty:run-war {execution: default-cli}]
> [INFO] Configuring Jetty for project: Mahout Taste Webapp
> 2009-08-18 17:29:38.216::INFO:  Logging to STDERR via  
> org.eclipse.jetty.util.log
> .StdErrLog
> [INFO] Context path = /
> [INFO] Tmp directory = C:\workspace\Mahout for Zanox\taste-web\target 
> \work
> [INFO] Web defaults = org/eclipse/jetty/webapp/webdefault.xml
> [INFO] Web overrides =  none
> [INFO] Starting jetty 7.0.0.M4 ...
> 2009-08-18 17:29:38.247::INFO:  jetty-7.0.0.M4
> 2009-08-18 17:29:38.278::INFO:  Extract C:\workspace\Mahout for Zanox 
> \taste-web\
> target\mahout-taste-webapp-0.2-SNAPSHOT.war to C:\workspace\Mahout  
> for Zanox\tas
> te-web\target\work\webapp
> 2009-08-18 17:29:41.106::WARN:  Failed startup of context  
> JettyWebAppContext@4eb
> 585@4eb585/,file:/C:/workspace/Mahout%20for%20Zanox/taste-web/target/ 
> work/webapp
> /,C:\workspace\Mahout for Zanox\taste-web\target\mahout-taste- 
> webapp-0.2-SNAPSHO
> T.war
> java.net.URISyntaxException: Illegal character in path at index 18:  
> file:/C:/Dok
> umente und Einstellungen/bda/.m2/repository/org/mortbay/jetty/jetty- 
> maven-plugin
> /7.0.0.1beta3/jetty-maven-plugin-7.0.0.1beta3.jar
>        at java.net.URI$Parser.fail(URI.java:2809)
>        at java.net.URI$Parser.checkChars(URI.java:2982)
>        at java.net.URI$Parser.parseHierarchical(URI.java:3066)
>        at java.net.URI$Parser.parse(URI.java:3014)
>        at java.net.URI.<init>(URI.java:578)
>        at java.net.URL.toURI(URL.java:918)
>        at  
> org.eclipse.jetty.webapp.WebInfConfiguration.preConfigure(WebInfConfi
> guration.java:79)
>        at  
> org.mortbay.jetty.plugin.MavenWebInfConfiguration.preConfigure(MavenW
> ebInfConfiguration.java:39)
>        at  
> org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:343
> )
>        at  
> org.mortbay.jetty.plugin.JettyWebAppContext.doStart(JettyWebAppContex
> t.java:89)
>        at  
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLife
> Cycle.java:56)
>        at  
> org.eclipse.jetty.server.handler.HandlerCollection.doStart(HandlerCol
> lection.java:164)
>        at  
> org.eclipse.jetty.server.handler.ContextHandlerCollection.doStart(Con
> textHandlerCollection.java:161)
>        at  
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLife
> Cycle.java:56)
>        at  
> org.eclipse.jetty.server.handler.HandlerCollection.doStart(HandlerCol
> lection.java:164)
>        at  
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLife
> Cycle.java:56)
>        at  
> org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrappe
> r.java:92)
>        at org.eclipse.jetty.server.Server.doStart(Server.java:225)
>        at  
> org.mortbay.jetty.plugin.JettyServer.doStart(JettyServer.java:69)
>        at  
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLife
> Cycle.java:56)
>        at  
> org.mortbay.jetty.plugin.AbstractJettyMojo.startJetty(AbstractJettyMo
> jo.java:423)
>        at  
> org.mortbay.jetty.plugin.AbstractJettyMojo.execute(AbstractJettyMojo.
> java:366)
>        at  
> org.mortbay.jetty.plugin.JettyRunWarMojo.execute(JettyRunWarMojo.java
> :68)
>        at  
> org.apache.maven.plugin.DefaultPluginManager.executeMojo(DefaultPlugi
> nManager.java:483)
>        at  
> org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoals(Defa
> ultLifecycleExecutor.java:678)
>        at  
> org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeStandalone
> Goal(DefaultLifecycleExecutor.java:553)
>        at  
> org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoal(Defau
> ltLifecycleExecutor.java:523)
>        at  
> org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalAndHan
> dleFailures(DefaultLifecycleExecutor.java:371)
>        at  
> org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeTaskSegmen
> ts(DefaultLifecycleExecutor.java:332)
>        at  
> org.apache.maven.lifecycle.DefaultLifecycleExecutor.execute(DefaultLi
> fecycleExecutor.java:181)
>        at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java: 
> 356)
>        at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:137)
>        at org.apache.maven.cli.MavenCli.main(MavenCli.java:362)
>        at  
> org.apache.maven.cli.compat.CompatibleMain.main(CompatibleMain.java:4
> 1)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at  
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
> java:39)
>        at  
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> sorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at  
> org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315)
>        at org.codehaus.classworlds.Launcher.launch(Launcher.java:255)
>        at  
> org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430)
>
>        at org.codehaus.classworlds.Launcher.main(Launcher.java:375)
> 2009-08-18 17:29:41.184::INFO:  Started  
> SelectChannelConnector@0.0.0.0:8080
> [INFO] Started Jetty Server
>
> _______________________________________
> Benjamin Dageroth, Key Account Manager / Softwareentwickler
> Webtrekk GmbH
> Boxhagener Str. 76-78, 10245 Berlin
> fon 030 - 755 415 - 360
> fax 030 - 755 415 - 100
> benjamin.dageroth@webtrekk.com
> http://www.webtrekk.com
> Amtsgericht Berlin, HRB 93435 B
> Geschäftsführer Christian Sauer
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


RE: AW: Validating clustering output

Posted by Jack Tanner <ih...@hotmail.com>.
As Grant said, please start new threads for new questions.

Aside from that, this is apparently a known issue in maven/jetty.

https://issues.sonatype.org/browse/MVNDEF-114
http://jira.codehaus.org/browse/JETTY-1063

One workaround is to define a localRepository path that has no spaces.

----------------------------------------
> From: Benjamin.Dageroth@webtrekk.com
> To: mahout-user@lucene.apache.org
> Date: Tue, 18 Aug 2009 17:37:46 +0200
> Subject: AW: Validating clustering output
>
> I just installed Mahout on my windows machine and wanted to try out the taste example with the grouplens data. Although I seem to have done everything according to the suggested instructions at http://lucene.apache.org/mahout/taste.html#demo - However, I cannot get the webapp running and get a 503 message: Service unavailable. When starting jetty, the servlet Container accompanying the demo, it goes through and boasts that it started Jetty Server, but during startup it lets me know that there is an exception, which I suppose will be the culprit.
>
> java.net.URISyntaxException: Illegal character in path at index 18: file:/C:/Dokumente und Einstellungen/bda/.m2/repository/org/mortbay/jetty/jetty-maven-plugin/7.0.0.1beta3/jetty-maven-plugin-7.0.0.1beta3.jar
> at java.net.URI$Parser.fail
> at java.net.URI$Parser.checkChars
> at java.net.URI$Parser.parseHierarchical
> at java.net.URI$Parser.parse
> at java.net.URI.
> at java.net.URL.toURI
> ...
> Etc.
> The complete log of the startup process can be found further down. I would guess that empty spaces might pose a problem, but I am not sure what I can do about that when the home directory of a user is used which is always filed under c:\dokumente und Einstellungen\ and maven goes to look there.
>
> Any Idea where I can change the path, in case that this is indeed the problem? Otherwise, what is my problem? ;-)
>
> Thanks a lot,
>
> Benjamin
>
> ------------------------------------------------------------------
> Complete Log:
> $ /cygdrive/c/workspace/maven/apache-maven-2.2.0/bin/mvn jetty:run-war
> [INFO] Scanning for projects...
> [INFO] ------------------------------------------------------------------------
> [INFO] Building Mahout Taste Webapp
> [INFO] task-segment: [jetty:run-war]
> [INFO] ------------------------------------------------------------------------
> [INFO] Preparing jetty:run-war
> [INFO] [resources:resources {execution: default-resources}]
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] Copying 4 resources
> [INFO] Copying 1 resource to c:\workspace\Mahout for Zanox\taste-web\target/maho
> ut-taste-webapp-0.2-SNAPSHOT/WEB-INF/lib
> [INFO] [resources:copy-resources {execution: copy-resources}]
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] Copying 3 resources
> [INFO] [compiler:compile {execution: default-compile}]
> [INFO] Nothing to compile - all classes are up to date
> [INFO] [resources:testResources {execution: default-testResources}]
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] skip non existing resourceDirectory c:\workspace\Mahout for Zanox\taste-w
> eb\src\test\resources
> [INFO] [compiler:testCompile {execution: default-testCompile}]
> [INFO] Nothing to compile - all classes are up to date
> [INFO] [surefire:test {execution: default-test}]
> [INFO] No tests to run.
> [INFO] [war:war {execution: default-war}]
> [INFO] Packaging webapp
> [INFO] Assembling webapp[mahout-taste-webapp] in [c:\workspace\Mahout for Zanox\
> taste-web\target\mahout-taste-webapp-0.2-SNAPSHOT]
> [INFO] Dependency[Dependency {groupId=org.apache.mahout, artifactId=mahout-core,
> version=0.2-SNAPSHOT, type=jar}] has changed (was Dependency {groupId=org.apach
> e.mahout, artifactId=mahout-core, version=0.2-SNAPSHOT, type=jar}).
> [INFO] Dependency[Dependency {groupId=axis, artifactId=axis, version=1.4, type=j
> ar}] has changed (was Dependency {groupId=axis, artifactId=axis, version=1.4, ty
> pe=jar}).
> [INFO] Dependency[Dependency {groupId=javax.servlet, artifactId=servlet-api, ver
> sion=2.4, type=jar}] has changed (was Dependency {groupId=javax.servlet, artifac
> tId=servlet-api, version=2.4, type=jar}).
> [INFO] Dependency[Dependency {groupId=org.slf4j, artifactId=slf4j-api, version=1
> .5.6, type=jar}] has changed (was Dependency {groupId=org.slf4j, artifactId=slf4
> j-api, version=1.5.6, type=jar}).
> [INFO] Dependency[Dependency {groupId=org.slf4j, artifactId=slf4j-jcl, version=1
> .5.6, type=jar}] has changed (was Dependency {groupId=org.slf4j, artifactId=slf4
> j-jcl, version=1.5.6, type=jar}).
> [INFO] Processing war project
> [INFO] Copying webapp resources[c:\workspace\Mahout for Zanox\taste-web\src\main
> \webapp]
> [INFO] Webapp assembled in[94 msecs]
> [INFO] Building war: c:\workspace\Mahout for Zanox\taste-web\target\mahout-taste
> -webapp-0.2-SNAPSHOT.war
> [INFO] [jetty:run-war {execution: default-cli}]
> [INFO] Configuring Jetty for project: Mahout Taste Webapp
> 2009-08-18 17:29:38.216::INFO: Logging to STDERR via org.eclipse.jetty.util.log
> .StdErrLog
> [INFO] Context path = /
> [INFO] Tmp directory = C:\workspace\Mahout for Zanox\taste-web\target\work
> [INFO] Web defaults = org/eclipse/jetty/webapp/webdefault.xml
> [INFO] Web overrides = none
> [INFO] Starting jetty 7.0.0.M4 ...
> 2009-08-18 17:29:38.247::INFO: jetty-7.0.0.M4
> 2009-08-18 17:29:38.278::INFO: Extract C:\workspace\Mahout for Zanox\taste-web\
> target\mahout-taste-webapp-0.2-SNAPSHOT.war to C:\workspace\Mahout for Zanox\tas
> te-web\target\work\webapp
> 2009-08-18 17:29:41.106::WARN: Failed startup of context JettyWebAppContext@4eb
> 585@4eb585/,file:/C:/workspace/Mahout%20for%20Zanox/taste-web/target/work/webapp
> /,C:\workspace\Mahout for Zanox\taste-web\target\mahout-taste-webapp-0.2-SNAPSHO
> T.war
> java.net.URISyntaxException: Illegal character in path at index 18: file:/C:/Dok
> umente und Einstellungen/bda/.m2/repository/org/mortbay/jetty/jetty-maven-plugin
> /7.0.0.1beta3/jetty-maven-plugin-7.0.0.1beta3.jar
> at java.net.URI$Parser.fail(URI.java:2809)
> at java.net.URI$Parser.checkChars(URI.java:2982)
> at java.net.URI$Parser.parseHierarchical(URI.java:3066)
> at java.net.URI$Parser.parse(URI.java:3014)
> at java.net.URI.(URI.java:578)
> at java.net.URL.toURI(URL.java:918)
> at org.eclipse.jetty.webapp.WebInfConfiguration.preConfigure(WebInfConfi
> guration.java:79)
> at org.mortbay.jetty.plugin.MavenWebInfConfiguration.preConfigure(MavenW
> ebInfConfiguration.java:39)
> at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:343
> )
> at org.mortbay.jetty.plugin.JettyWebAppContext.doStart(JettyWebAppContex
> t.java:89)
> at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLife
> Cycle.java:56)
> at org.eclipse.jetty.server.handler.HandlerCollection.doStart(HandlerCol
> lection.java:164)
> at org.eclipse.jetty.server.handler.ContextHandlerCollection.doStart(Con
> textHandlerCollection.java:161)
> at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLife
> Cycle.java:56)
> at org.eclipse.jetty.server.handler.HandlerCollection.doStart(HandlerCol
> lection.java:164)
> at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLife
> Cycle.java:56)
> at org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrappe
> r.java:92)
> at org.eclipse.jetty.server.Server.doStart(Server.java:225)
> at org.mortbay.jetty.plugin.JettyServer.doStart(JettyServer.java:69)
> at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLife
> Cycle.java:56)
> at org.mortbay.jetty.plugin.AbstractJettyMojo.startJetty(AbstractJettyMo
> jo.java:423)
> at org.mortbay.jetty.plugin.AbstractJettyMojo.execute(AbstractJettyMojo.
> java:366)
> at org.mortbay.jetty.plugin.JettyRunWarMojo.execute(JettyRunWarMojo.java
> :68)
> at org.apache.maven.plugin.DefaultPluginManager.executeMojo(DefaultPlugi
> nManager.java:483)
> at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoals(Defa
> ultLifecycleExecutor.java:678)
> at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeStandalone
> Goal(DefaultLifecycleExecutor.java:553)
> at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoal(Defau
> ltLifecycleExecutor.java:523)
> at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalAndHan
> dleFailures(DefaultLifecycleExecutor.java:371)
> at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeTaskSegmen
> ts(DefaultLifecycleExecutor.java:332)
> at org.apache.maven.lifecycle.DefaultLifecycleExecutor.execute(DefaultLi
> fecycleExecutor.java:181)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:356)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:137)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:362)
> at org.apache.maven.cli.compat.CompatibleMain.main(CompatibleMain.java:4
> 1)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
> java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> sorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315)
> at org.codehaus.classworlds.Launcher.launch(Launcher.java:255)
> at org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430)
>
> at org.codehaus.classworlds.Launcher.main(Launcher.java:375)
> 2009-08-18 17:29:41.184::INFO: Started SelectChannelConnector@0.0.0.0:8080
> [INFO] Started Jetty Server
>
> _______________________________________
> Benjamin Dageroth, Key Account Manager / Softwareentwickler
> Webtrekk GmbH
> Boxhagener Str. 76-78, 10245 Berlin
> fon 030 - 755 415 - 360
> fax 030 - 755 415 - 100
> benjamin.dageroth@webtrekk.com
> http://www.webtrekk.com
> Amtsgericht Berlin, HRB 93435 B
> Geschäftsführer Christian Sauer
>

_________________________________________________________________
Hotmail® is up to 70% faster. Now good news travels really fast. 
http://windowslive.com/online/hotmail?ocid=PID23391::T:WLMTAGL:ON:WL:en-US:WM_HYGN_faster:082009

AW: Validating clustering output

Posted by Benjamin Dageroth <Be...@webtrekk.com>.
I just installed Mahout on my windows machine and wanted to try out the taste example with the grouplens data. Although I seem to have done everything according to the suggested instructions at http://lucene.apache.org/mahout/taste.html#demo - However, I cannot get the webapp running and get a 503 message: Service unavailable. When starting jetty, the servlet Container accompanying the demo, it goes through and boasts that it started Jetty Server, but during startup it lets me know that there is an exception, which I suppose will be the culprit.

java.net.URISyntaxException: Illegal character in path at index 18: file:/C:/Dokumente und Einstellungen/bda/.m2/repository/org/mortbay/jetty/jetty-maven-plugin/7.0.0.1beta3/jetty-maven-plugin-7.0.0.1beta3.jar
at java.net.URI$Parser.fail<URI.java:2089>
at java.net.URI$Parser.checkChars<URI.java:2982>
at java.net.URI$Parser.parseHierarchical<URI.java:3066>
at java.net.URI$Parser.parse<URI.java:3014>
at java.net.URI.<init><URI.java:578>
at java.net.URL.toURI<URL.java:918>
...
Etc.
The complete log of the startup process can be found further down. I would guess that empty spaces might pose a problem, but I am not sure what I can do about that when the home directory of a user is used which is always filed under c:\dokumente und Einstellungen\ and maven goes to look there.

Any Idea where I can change the path, in case that this is indeed the problem? Otherwise, what is my problem? ;-)

Thanks a lot,

Benjamin

------------------------------------------------------------------
Complete Log:
$ /cygdrive/c/workspace/maven/apache-maven-2.2.0/bin/mvn jetty:run-war
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Building Mahout Taste Webapp
[INFO]    task-segment: [jetty:run-war]
[INFO] ------------------------------------------------------------------------
[INFO] Preparing jetty:run-war
[INFO] [resources:resources {execution: default-resources}]
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 4 resources
[INFO] Copying 1 resource to c:\workspace\Mahout for Zanox\taste-web\target/maho
ut-taste-webapp-0.2-SNAPSHOT/WEB-INF/lib
[INFO] [resources:copy-resources {execution: copy-resources}]
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 3 resources
[INFO] [compiler:compile {execution: default-compile}]
[INFO] Nothing to compile - all classes are up to date
[INFO] [resources:testResources {execution: default-testResources}]
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory c:\workspace\Mahout for Zanox\taste-w
eb\src\test\resources
[INFO] [compiler:testCompile {execution: default-testCompile}]
[INFO] Nothing to compile - all classes are up to date
[INFO] [surefire:test {execution: default-test}]
[INFO] No tests to run.
[INFO] [war:war {execution: default-war}]
[INFO] Packaging webapp
[INFO] Assembling webapp[mahout-taste-webapp] in [c:\workspace\Mahout for Zanox\
taste-web\target\mahout-taste-webapp-0.2-SNAPSHOT]
[INFO] Dependency[Dependency {groupId=org.apache.mahout, artifactId=mahout-core,
 version=0.2-SNAPSHOT, type=jar}] has changed (was Dependency {groupId=org.apach
e.mahout, artifactId=mahout-core, version=0.2-SNAPSHOT, type=jar}).
[INFO] Dependency[Dependency {groupId=axis, artifactId=axis, version=1.4, type=j
ar}] has changed (was Dependency {groupId=axis, artifactId=axis, version=1.4, ty
pe=jar}).
[INFO] Dependency[Dependency {groupId=javax.servlet, artifactId=servlet-api, ver
sion=2.4, type=jar}] has changed (was Dependency {groupId=javax.servlet, artifac
tId=servlet-api, version=2.4, type=jar}).
[INFO] Dependency[Dependency {groupId=org.slf4j, artifactId=slf4j-api, version=1
.5.6, type=jar}] has changed (was Dependency {groupId=org.slf4j, artifactId=slf4
j-api, version=1.5.6, type=jar}).
[INFO] Dependency[Dependency {groupId=org.slf4j, artifactId=slf4j-jcl, version=1
.5.6, type=jar}] has changed (was Dependency {groupId=org.slf4j, artifactId=slf4
j-jcl, version=1.5.6, type=jar}).
[INFO] Processing war project
[INFO] Copying webapp resources[c:\workspace\Mahout for Zanox\taste-web\src\main
\webapp]
[INFO] Webapp assembled in[94 msecs]
[INFO] Building war: c:\workspace\Mahout for Zanox\taste-web\target\mahout-taste
-webapp-0.2-SNAPSHOT.war
[INFO] [jetty:run-war {execution: default-cli}]
[INFO] Configuring Jetty for project: Mahout Taste Webapp
2009-08-18 17:29:38.216::INFO:  Logging to STDERR via org.eclipse.jetty.util.log
.StdErrLog
[INFO] Context path = /
[INFO] Tmp directory = C:\workspace\Mahout for Zanox\taste-web\target\work
[INFO] Web defaults = org/eclipse/jetty/webapp/webdefault.xml
[INFO] Web overrides =  none
[INFO] Starting jetty 7.0.0.M4 ...
2009-08-18 17:29:38.247::INFO:  jetty-7.0.0.M4
2009-08-18 17:29:38.278::INFO:  Extract C:\workspace\Mahout for Zanox\taste-web\
target\mahout-taste-webapp-0.2-SNAPSHOT.war to C:\workspace\Mahout for Zanox\tas
te-web\target\work\webapp
2009-08-18 17:29:41.106::WARN:  Failed startup of context JettyWebAppContext@4eb
585@4eb585/,file:/C:/workspace/Mahout%20for%20Zanox/taste-web/target/work/webapp
/,C:\workspace\Mahout for Zanox\taste-web\target\mahout-taste-webapp-0.2-SNAPSHO
T.war
java.net.URISyntaxException: Illegal character in path at index 18: file:/C:/Dok
umente und Einstellungen/bda/.m2/repository/org/mortbay/jetty/jetty-maven-plugin
/7.0.0.1beta3/jetty-maven-plugin-7.0.0.1beta3.jar
        at java.net.URI$Parser.fail(URI.java:2809)
        at java.net.URI$Parser.checkChars(URI.java:2982)
        at java.net.URI$Parser.parseHierarchical(URI.java:3066)
        at java.net.URI$Parser.parse(URI.java:3014)
        at java.net.URI.<init>(URI.java:578)
        at java.net.URL.toURI(URL.java:918)
        at org.eclipse.jetty.webapp.WebInfConfiguration.preConfigure(WebInfConfi
guration.java:79)
        at org.mortbay.jetty.plugin.MavenWebInfConfiguration.preConfigure(MavenW
ebInfConfiguration.java:39)
        at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:343
)
        at org.mortbay.jetty.plugin.JettyWebAppContext.doStart(JettyWebAppContex
t.java:89)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLife
Cycle.java:56)
        at org.eclipse.jetty.server.handler.HandlerCollection.doStart(HandlerCol
lection.java:164)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.doStart(Con
textHandlerCollection.java:161)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLife
Cycle.java:56)
        at org.eclipse.jetty.server.handler.HandlerCollection.doStart(HandlerCol
lection.java:164)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLife
Cycle.java:56)
        at org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrappe
r.java:92)
        at org.eclipse.jetty.server.Server.doStart(Server.java:225)
        at org.mortbay.jetty.plugin.JettyServer.doStart(JettyServer.java:69)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLife
Cycle.java:56)
        at org.mortbay.jetty.plugin.AbstractJettyMojo.startJetty(AbstractJettyMo
jo.java:423)
        at org.mortbay.jetty.plugin.AbstractJettyMojo.execute(AbstractJettyMojo.
java:366)
        at org.mortbay.jetty.plugin.JettyRunWarMojo.execute(JettyRunWarMojo.java
:68)
        at org.apache.maven.plugin.DefaultPluginManager.executeMojo(DefaultPlugi
nManager.java:483)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoals(Defa
ultLifecycleExecutor.java:678)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeStandalone
Goal(DefaultLifecycleExecutor.java:553)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoal(Defau
ltLifecycleExecutor.java:523)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalAndHan
dleFailures(DefaultLifecycleExecutor.java:371)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeTaskSegmen
ts(DefaultLifecycleExecutor.java:332)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.execute(DefaultLi
fecycleExecutor.java:181)
        at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:356)
        at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:137)
        at org.apache.maven.cli.MavenCli.main(MavenCli.java:362)
        at org.apache.maven.cli.compat.CompatibleMain.main(CompatibleMain.java:4
1)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315)
        at org.codehaus.classworlds.Launcher.launch(Launcher.java:255)
        at org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430)

        at org.codehaus.classworlds.Launcher.main(Launcher.java:375)
2009-08-18 17:29:41.184::INFO:  Started SelectChannelConnector@0.0.0.0:8080
[INFO] Started Jetty Server

_______________________________________
Benjamin Dageroth, Key Account Manager / Softwareentwickler
Webtrekk GmbH
Boxhagener Str. 76-78, 10245 Berlin
fon 030 - 755 415 - 360
fax 030 - 755 415 - 100
benjamin.dageroth@webtrekk.com
http://www.webtrekk.com
Amtsgericht Berlin, HRB 93435 B
Geschäftsführer Christian Sauer


Re: Validating clustering output

Posted by Grant Ingersoll <gs...@apache.org>.
Also found: http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

On Aug 18, 2009, at 9:55 AM, Grant Ingersoll wrote:

>
> On Jul 27, 2009, at 9:42 PM, Ted Dunning wrote:
>
>> The other reference I am looking for may be in David Mackay's  
>> book.  The
>> idea is that you measure the quality of the approximation by  
>> looking at the
>> entropy in the cluster assignment relative to the residual required  
>> to
>> precisely specify the original data relative to the quantized value.
>
> Is the WM Rand paper in JSTOR ("Object Criteria for Evaluation of  
> Clustering Methods") worthwhile on this topic?  Basic searches for  
> "evaluating clustering" or "cluster evaluation" on Google Scholar  
> turn up very little.  The Rand paper is from 1971, but who knows...
>
> Of course, I'd like something that doesn't require purchase (sigh.)



Re: Validating clustering output

Posted by Grant Ingersoll <gs...@apache.org>.
On Jul 27, 2009, at 9:42 PM, Ted Dunning wrote:

> The other reference I am looking for may be in David Mackay's book.   
> The
> idea is that you measure the quality of the approximation by looking  
> at the
> entropy in the cluster assignment relative to the residual required to
> precisely specify the original data relative to the quantized value.

Is the WM Rand paper in JSTOR ("Object Criteria for Evaluation of  
Clustering Methods") worthwhile on this topic?  Basic searches for  
"evaluating clustering" or "cluster evaluation" on Google Scholar turn  
up very little.  The Rand paper is from 1971, but who knows...

Of course, I'd like something that doesn't require purchase (sigh.)

Re: Validating clustering output

Posted by Ted Dunning <te...@gmail.com>.
(vastly delayed response ... huge distractions competing with more than 2
minutes answers are to blame)

Grant,

For evaluating clustering for symbol sequences:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.7275

Most of the other references I have found talk about quality relative to
gold standard judgments about whether exemplars are in the same class or
relative to similarity/distinctiveness ratios.  Neither is all that
satisfactory.

My preference is an entropic measure that describes how much of the
information in your data is captured by the clustering vs how much residual
info there is.

The other reference I am looking for may be in David Mackay's book.  The
idea is that you measure the quality of the approximation by looking at the
entropy in the cluster assignment relative to the residual required to
precisely specify the original data relative to the quantized value.

This is also related to trading off signal/noise in a vector quantizer.

David,  do you have a moment to talk about this with me?  I can't free up
the time to chase these final references and come up with a nice formula for
this.  I think you could do it in 10-20 minutes.

On Tue, Jul 14, 2009 at 6:41 AM, Grant Ingersoll <gs...@apache.org>wrote:

> On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:
>
>  A principled approach to cluster evaluation is to measure how well the
>> cluster membership captures the structure of unseen data.  A natural
>> measure
>> for this is to measure how much of the entropy of the data is captured by
>> cluster membership.  For k-means and its natural L_2 metric, the natural
>> cluster quality metric is the squared distance from the nearest centroid
>> adjusted by the log_2 of the number of clusters.  This can be compared to
>> the squared magnitude of the original data or the squared deviation from
>> the
>> centroid for all of the data.  The idea is that you are changing the
>> representation of the data by allocating some of the bits in your original
>> representation to represent which cluster each point is in.  If those bits
>> aren't made up by the residue being small then your clustering is making a
>> bad trade-off.
>>
>> In the past, I have used other more heuristic measures as well.  One of
>> the
>> key characteristics that I would like to see out of a clustering is a
>> degree
>> of stability.  Thus, I look at the fractions of points that are assigned
>> to
>> each cluster or the distribution of distances from the cluster centroid.
>> These values should be relatively stable when applied to held-out data.
>>
>> For text, you can actually compute perplexity which measures how well
>> cluster membership predicts what words are used.  This is nice because you
>> don't have to worry about the entropy of real valued numbers.
>>
>
> Do you have any references on any of the above approaches?
>



-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)