You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Toby Doig <to...@gmail.com> on 2010/04/06 06:42:15 UTC

clustering your data with dirichlet issue

I've run dirichlet commandline and now have an output folder with some
state-0, state-1, ... state-5 folders which each contain part-00000 and
.part-00000.crc files. However the  ClusteringYourData wiki page's
Retrieving the Output section just says TODO. I don't know how to turn those
part files into something useful.

    http://cwiki.apache.org/MAHOUT/clusteringyourdata.html

I successfully ran
the org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job test which
outputted data as text (to console at least) so I tried ripping the
printResults() methods from that class and putting them
in org.apache.mahout.clustering.dirichlet.DirichletJob but to no avail.

Can someone help?

Also, when running the commandline job it asks for the prototypeSize (-s
param) so when I converted my Lucene index to a vector file the output said
it created 11 vectors, but with i specified that value for prototypeSize the
job failed saying it found 1793 vectors. Changing the value i specify to
1793 works but i now wonder why i need to specify it if it can figure it
out? Could it not be optional?

Re: MAHOUT-236 Cluster Evaluation Tools?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
That's not what I get from the paper. Certainly, the cluster center is 
the first representative point. But the paper talks about subsequently 
iterating through the clustered points to find the farthest point from 
the previously-selected representative points (RPs) and then adding that 
as another representative point. After a few such iterations, a set of 
RPs is developed for each cluster that defines the extreme points 
observed within the cluster. This is especially useful for non-spherical 
clusters, such as those returned by mean shift and Dirichlet asymmetric 
models. Then, in the final stage, the RPs in each cluster are compared 
and the closest RPs are used to compute CDbw. The final calculation can 
be done in memory since the number of clusters and RPs is well-bounded 
by then.

I get that each RP iteration takes place over all of the clustered 
points and would require a new MR job for each iteration. I imagine 
initializing the mappers and reducers with the set of clusters and their 
RPs. Then each mapper processes a subset of all clustered points, 
finally outputting the farthest it has seen for each cluster. The 
reducer gets this information and selects the RP that is absolutely the 
most distant, outputting it with the clusters+RPs for the next 
iteration. This is a lot like the way Dirichlet works now, outputting 
state to be used for the next iteration over the entire point set. We 
would need to allow a DistanceMeasure to be specified for this phase.

Currently, only canopy and kMeans actually produce their clustered 
points. Dirichlet points could be clustered by assigning each point to 
the model with the largest pdf (or even to more than one based upon a 
user-settable pdf threshold). Fuzzy kMeans would need to make similar 
assignments. MeanShift point ids are currently retained in its cluster 
state but there is no step to build clustered points like canopy and 
kMeans do. Some work would be needed here too, as we need a uniform 
representation for clustered points.

Finally, I'd like to review the output file naming conventions across 
all the clustering algorithms and converge on a single nomenclature that 
is common across all jobs.

Robin Anil wrote:
> Cluster center itself is a representative point. One pass over the data will
> get us that close enough points. Or exhaustively, we can just add it in the
> Kmeans Mapper and update a counter maybe?
>
> Robin
>
>   


Re: MAHOUT-236 Cluster Evaluation Tools?

Posted by Robin Anil <ro...@gmail.com>.
Cluster center itself is a representative point. One pass over the data will
get us that close enough points. Or exhaustively, we can just add it in the
Kmeans Mapper and update a counter maybe?

Robin

On Fri, Apr 9, 2010 at 4:13 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Looking at the paper it doesn't seem to require MR for the final CDbw
> calculation, right? For each cluster we only need to compare one of its
> points with one point in each other cluster. With small numbers of
> representative points per cluster that can be done easily in memory. I'd
> love to see the code you have for computing representative points.
>
> Jeff
>
>
>
> Robin Anil wrote:
>
>> On Wed, Apr 7, 2010 at 11:50 PM, Jeff Eastman <jdog@windwardsolutions.com
>> >wrote:
>>
>>
>>
>>> Hi Robin,
>>>
>>> Interesting paper. I'm beginning to see how to MR the representative
>>> point
>>> selection already. The rest will hopefully become clearer with more
>>> study.
>>> Lots of MR jobs are needed to:
>>>
>>>
>>
>>
>>
>>
>>
>>> a) get the data into Vectors, We have something for text, missing for
>>> other
>>> formats
>>>
>>>
>>
>>
>>
>>
>>
>>> b) iterate (e.g. kmeans) over the data to produce a set of clusters, Done
>>>
>>>
>>
>>
>>
>>
>>
>>> c) cluster the data, Done
>>>
>>>
>>
>>
>>
>>
>>
>>> d) iterate over the clustered data to derive representative points for
>>> each
>>> cluster, and finally Done ;)
>>>
>>>
>>
>>
>>
>>
>>
>>> e) produce the CDbw.- TODO
>>>
>>>
>>
>>
>>
>>
>>
>>
>>> And, of course all of this is again iterated with different values for
>>> the
>>> clustering algorithm's parameters. Should keep the lights on at PG&E
>>> producing power for the server farms.
>>>
>>>
>>>
>>> Robin Anil wrote:
>>>
>>>
>>>
>>>> Hi Jeff,
>>>>           This is an good paper with a simple measure of cluster quality
>>>> measurement based on intra cluster density and inter cluster separation.
>>>> Its
>>>> pretty easy to compute. Need to make it a map/reduce job
>>>>
>>>>
>>>> http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
>>>> Robin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>

Re: MAHOUT-236 Cluster Evaluation Tools?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Looking at the paper it doesn't seem to require MR for the final CDbw 
calculation, right? For each cluster we only need to compare one of its 
points with one point in each other cluster. With small numbers of 
representative points per cluster that can be done easily in memory. I'd 
love to see the code you have for computing representative points.

Jeff


Robin Anil wrote:
> On Wed, Apr 7, 2010 at 11:50 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
>
>   
>> Hi Robin,
>>
>> Interesting paper. I'm beginning to see how to MR the representative point
>> selection already. The rest will hopefully become clearer with more study.
>> Lots of MR jobs are needed to:
>>     
>
>
>
>   
>> a) get the data into Vectors, We have something for text, missing for other
>> formats
>>     
>
>
>
>   
>> b) iterate (e.g. kmeans) over the data to produce a set of clusters, Done
>>     
>
>
>
>   
>> c) cluster the data, Done
>>     
>
>
>
>   
>> d) iterate over the clustered data to derive representative points for each
>> cluster, and finally Done ;)
>>     
>
>
>
>   
>> e) produce the CDbw.- TODO
>>     
>
>
>
>
>   
>> And, of course all of this is again iterated with different values for the
>> clustering algorithm's parameters. Should keep the lights on at PG&E
>> producing power for the server farms.
>>
>>
>>
>> Robin Anil wrote:
>>
>>     
>>> Hi Jeff,
>>>            This is an good paper with a simple measure of cluster quality
>>> measurement based on intra cluster density and inter cluster separation.
>>> Its
>>> pretty easy to compute. Need to make it a map/reduce job
>>>
>>> http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
>>> Robin
>>>
>>>
>>>
>>>
>>>       
>>     
>
>   


Re: MAHOUT-236 Cluster Evaluation Tools?

Posted by Robin Anil <ro...@gmail.com>.
On Wed, Apr 7, 2010 at 11:50 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Hi Robin,
>
> Interesting paper. I'm beginning to see how to MR the representative point
> selection already. The rest will hopefully become clearer with more study.
> Lots of MR jobs are needed to:



> a) get the data into Vectors, We have something for text, missing for other
> formats



> b) iterate (e.g. kmeans) over the data to produce a set of clusters, Done



> c) cluster the data, Done



> d) iterate over the clustered data to derive representative points for each
> cluster, and finally Done ;)



> e) produce the CDbw.- TODO




> And, of course all of this is again iterated with different values for the
> clustering algorithm's parameters. Should keep the lights on at PG&E
> producing power for the server farms.
>
>
>
> Robin Anil wrote:
>
>> Hi Jeff,
>>            This is an good paper with a simple measure of cluster quality
>> measurement based on intra cluster density and inter cluster separation.
>> Its
>> pretty easy to compute. Need to make it a map/reduce job
>>
>> http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
>> Robin
>>
>>
>>
>>
>
>

Re: MAHOUT-236 Cluster Evaluation Tools?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Hi Robin,

Interesting paper. I'm beginning to see how to MR the representative 
point selection already. The rest will hopefully become clearer with 
more study. Lots of MR jobs are needed to: a) get the data into Vectors, 
b) iterate (e.g. kmeans) over the data to produce a set of clusters, c) 
cluster the data, d) iterate over the clustered data to derive 
representative points for each cluster, and finally e) produce the CDbw. 
And, of course all of this is again iterated with different values for 
the clustering algorithm's parameters. Should keep the lights on at PG&E 
producing power for the server farms.


Robin Anil wrote:
> Hi Jeff,
>             This is an good paper with a simple measure of cluster quality
> measurement based on intra cluster density and inter cluster separation. Its
> pretty easy to compute. Need to make it a map/reduce job
> http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
> Robin
>
>
>   


Re: MAHOUT-236 Cluster Evaluation Tools?

Posted by Robin Anil <ro...@gmail.com>.
Hi Jeff,
            This is an good paper with a simple measure of cluster quality
measurement based on intra cluster density and inter cluster separation. Its
pretty easy to compute. Need to make it a map/reduce job
http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw
Robin


On Wed, Apr 7, 2010 at 7:03 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Hi Robin,
>
> Great! I've got the refactoring changes for consolidating all the various
> cluster types under a Cluster interface (formerly Printable but now with id,
> numPoints and a center added). Dirichlet models still don't yet have
> meaningful ids implemented but they all do (so far anyway) have a notion of
> "numPoints" and a "center". I'm working on tests tomorrow to make sure the
> ClusterDumper actually works with Dirichlet clusters then I will commit
> that. Wednesday or Thursday most likely.
>
> BTW, I changed my mind about foisting off the old Printable interface on
> Vectors (but am still open to the idea if somebody actually working in math
> thinks it is worth doing). All the new Clusters use the vector formatting
> done in ClusterBase.
>
> What I'd really like is feedback from ClusterDumper users on what is
> working and what is needed to address MAHOUT-236. That includes you, right?
>
> Jeff
>
> PS: Ted, you expressed some doubts about the value of consolidating
> Dirichlet clusters with the others. So far it seems to be a reasonable fit
> but I'm doing the engineering on a tiny subset of simple models without
> enough theoretical insight to see any pitfalls ahead. Is there a
> "DistanceMeasure-like" discussion that might provide a firmer underpinning
> for this work?
>
>
>
>
> Robin Anil wrote:
>
>> No one yet. I am willing to help In case you need an extra pair of hands
>> on
>> this one.
>>
>> Robin
>>
>>
>

Re: MAHOUT-236 Cluster Evaluation Tools?

Posted by Ted Dunning <te...@gmail.com>.
If it fits, then it is great to do.

On Tue, Apr 6, 2010 at 6:33 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> PS: Ted, you expressed some doubts about the value of consolidating
> Dirichlet clusters with the others. So far it seems to be a reasonable fit
> but I'm doing the engineering on a tiny subset of simple models without
> enough theoretical insight to see any pitfalls ahead. Is there a
> "DistanceMeasure-like" discussion that might provide a firmer underpinning
> for this work?
>

Re: MAHOUT-236 Cluster Evaluation Tools?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Hi Robin,

Great! I've got the refactoring changes for consolidating all the 
various cluster types under a Cluster interface (formerly Printable but 
now with id, numPoints and a center added). Dirichlet models still don't 
yet have meaningful ids implemented but they all do (so far anyway) have 
a notion of "numPoints" and a "center". I'm working on tests tomorrow to 
make sure the ClusterDumper actually works with Dirichlet clusters then 
I will commit that. Wednesday or Thursday most likely.

BTW, I changed my mind about foisting off the old Printable interface on 
Vectors (but am still open to the idea if somebody actually working in 
math thinks it is worth doing). All the new Clusters use the vector 
formatting done in ClusterBase.

What I'd really like is feedback from ClusterDumper users on what is 
working and what is needed to address MAHOUT-236. That includes you, right?

Jeff

PS: Ted, you expressed some doubts about the value of consolidating 
Dirichlet clusters with the others. So far it seems to be a reasonable 
fit but I'm doing the engineering on a tiny subset of simple models 
without enough theoretical insight to see any pitfalls ahead. Is there a 
"DistanceMeasure-like" discussion that might provide a firmer 
underpinning for this work?



Robin Anil wrote:
> No one yet. I am willing to help In case you need an extra pair of hands on
> this one.
>
> Robin
>   

Re: MAHOUT-236 Cluster Evaluation Tools?

Posted by Robin Anil <ro...@gmail.com>.
No one yet. I am willing to help In case you need an extra pair of hands on
this one.

Robin


On Wed, Apr 7, 2010 at 3:40 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Is anybody working on MAHOUT-236? To me it looks like the next logical step
> beyond generalizing the cluster dumper: improving on its summaries
>
> Jeff Eastman wrote:
>
>> Completing the ClusterDumper jira will allow for visual inspection of the
>> Dirichlet models and extracting some useful information thereof; arguably
>> not too useful with 1793-element vectors but this is also true of kmeans
>> clusters with 1793-element center vectors. With no terminating conditions,
>> selecting the particular iteration to inspect is also an issue unique to
>> Dirichlet. MAHOUT-236 has been around for a while and, as Jake notes below,
>> is really needed.
>>
>>
>

MAHOUT-236 Cluster Evaluation Tools?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Is anybody working on MAHOUT-236? To me it looks like the next logical 
step beyond generalizing the cluster dumper: improving on its summaries

Jeff Eastman wrote:
> Completing the ClusterDumper jira will allow for visual inspection of 
> the Dirichlet models and extracting some useful information thereof; 
> arguably not too useful with 1793-element vectors but this is also 
> true of kmeans clusters with 1793-element center vectors. With no 
> terminating conditions, selecting the particular iteration to inspect 
> is also an issue unique to Dirichlet. MAHOUT-236 has been around for a 
> while and, as Jake notes below, is really needed.
>


Re: clustering your data with dirichlet issue

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Completing the ClusterDumper jira will allow for visual inspection of 
the Dirichlet models and extracting some useful information thereof; 
arguably not too useful with 1793-element vectors but this is also true 
of kmeans clusters with 1793-element center vectors. With no terminating 
conditions, selecting the particular iteration to inspect is also an 
issue unique to Dirichlet. MAHOUT-236 has been around for a while and, 
as Jake notes below, is really needed.




Ted Dunning wrote:
> This isn't far from true.  I was just thinking something along the same
> lines, but phrased a bit differently.
>
> My thought was that if the concept and output is sooo different, will users
> be able to use it even if the dumper is made to work well?
>
>
> On Tue, Apr 6, 2010 at 10:27 AM, Jake Mannix <ja...@gmail.com> wrote:
>
>   
>>  Without this final step, this seems very much like an unfinished feature,
>> to the point of being unusable.
>>     
>
>   


Re: clustering your data with dirichlet issue

Posted by Jake Mannix <ja...@gmail.com>.
What we really need is a nice utility to take clustered output and maybe
label all
of the vectors in the training set (and new vectors, if it's either a
generative model
or one which allows "folding in") with some labels in a Vector wrapper
class,
and maybe some sort of statistics generating utility, which prints out
general
data about the clustering (number of points per cluster, how wide they are,
what
the centroids are or other stuff like that).

This is really something true of all of the clustering classes / jobs, not
just
Dirichlet.

  -jake

On Tue, Apr 6, 2010 at 10:30 AM, Ted Dunning <te...@gmail.com> wrote:

> This isn't far from true.  I was just thinking something along the same
> lines, but phrased a bit differently.
>
> My thought was that if the concept and output is sooo different, will users
> be able to use it even if the dumper is made to work well?
>
>
> On Tue, Apr 6, 2010 at 10:27 AM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> >
> >  Without this final step, this seems very much like an unfinished
> feature,
> > to the point of being unusable.
>

Re: clustering your data with dirichlet issue

Posted by Ted Dunning <te...@gmail.com>.
This isn't far from true.  I was just thinking something along the same
lines, but phrased a bit differently.

My thought was that if the concept and output is sooo different, will users
be able to use it even if the dumper is made to work well?


On Tue, Apr 6, 2010 at 10:27 AM, Jake Mannix <ja...@gmail.com> wrote:

>
>  Without this final step, this seems very much like an unfinished feature,
> to the point of being unusable.

Re: clustering your data with dirichlet issue

Posted by Jake Mannix <ja...@gmail.com>.
Hey Jeff,

  Excuse my ignorance of the Dirchlet clustering process, but in reading
your email
explaining this, I'm struck with the question: what is a user supposed to do
at all
currently with this output?  If the ClusterDumper can't spit it out until
MAHOUT-270
is in, and it's in a format which is Dirichlet-specific... what do we expect
people
to do with it once they've run this?

  Without this final step, this seems very much like an unfinished feature,
to the
point of being unusable.

  -jake

On Tue, Apr 6, 2010 at 10:14 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Toby Doig wrote:
>
>> I've run dirichlet commandline and now have an output folder with some
>> state-0, state-1, ... state-5 folders which each contain part-00000 and
>> .part-00000.crc files. However the  ClusteringYourData wiki page's
>> Retrieving the Output section just says TODO. I don't know how to turn
>> those
>> part files into something useful.
>>
>>    http://cwiki.apache.org/MAHOUT/clusteringyourdata.html
>>
>> I successfully ran
>> the org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job test which
>> outputted data as text (to console at least) so I tried ripping the
>> printResults() methods from that class and putting them
>> in org.apache.mahout.clustering.dirichlet.DirichletJob but to no avail.
>>
>> Can someone help?
>>
>> Also, when running the commandline job it asks for the prototypeSize (-s
>> param) so when I converted my Lucene index to a vector file the output
>> said
>> it created 11 vectors, but with i specified that value for prototypeSize
>> the
>> job failed saying it found 1793 vectors. Changing the value i specify to
>> 1793 works but i now wonder why i need to specify it if it can figure it
>> out? Could it not be optional?
>>
>>
>>
> Hi Toby,
>
> Each of the state-i directories contains a sequence file of the model
> states at the end of the i-th iteration. Since Dirichlet does not have a
> convergence criteria it will run for as many iterations as you select.
> Interpreting the results is also challenged by the fact that points are not
> assigned uniquely to a model - as in kmeans - or even with a probability -
> as in fuzzy kmeans. Each model does retain the number of points that it
> captured in that iteration - not the points themselves - so it is possible
> to back-fit the points to see which were the most likely to be captured by
> using the model's pdf() function and taking the top n points. Of course,
> that won't scale but check out TestL1ModelClustering in utils/ for some code
> that I used.
>
> The ClusterDumper is not able to dump the Dirichlet clusters though there
> is an issue to do this (MAHOUT-270) which is not yet completed. I'm working
> on it though, and you are welcome to make suggestions. Currently I'm trying
> to refactor the term priorities and other stuff in ClusterDumper to work
> with the Printable interface rather than relying upon ClusterBase.
>
> The prototype and prototypeSize arguments give you a way to specify the
> class and size of the Vectors which underly the existing models. One could
> probably glean this information by inspecting the first data element
> presented to the algorithm at initialization time. There is at this time no
> connection between the Lucene index to Vector transformation in utils and
> the Dirichlet job in core/ and no obvious way to introduce one given the
> dependencies.
>
> Code suggestions and patches to improve this all are of course welcome,
> Jeff
>

Re: clustering your data with dirichlet issue

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Toby Doig wrote:
> I've run dirichlet commandline and now have an output folder with some
> state-0, state-1, ... state-5 folders which each contain part-00000 and
> .part-00000.crc files. However the  ClusteringYourData wiki page's
> Retrieving the Output section just says TODO. I don't know how to turn those
> part files into something useful.
>
>     http://cwiki.apache.org/MAHOUT/clusteringyourdata.html
>
> I successfully ran
> the org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job test which
> outputted data as text (to console at least) so I tried ripping the
> printResults() methods from that class and putting them
> in org.apache.mahout.clustering.dirichlet.DirichletJob but to no avail.
>
> Can someone help?
>
> Also, when running the commandline job it asks for the prototypeSize (-s
> param) so when I converted my Lucene index to a vector file the output said
> it created 11 vectors, but with i specified that value for prototypeSize the
> job failed saying it found 1793 vectors. Changing the value i specify to
> 1793 works but i now wonder why i need to specify it if it can figure it
> out? Could it not be optional?
>
>   
Hi Toby,

Each of the state-i directories contains a sequence file of the model 
states at the end of the i-th iteration. Since Dirichlet does not have a 
convergence criteria it will run for as many iterations as you select. 
Interpreting the results is also challenged by the fact that points are 
not assigned uniquely to a model - as in kmeans - or even with a 
probability - as in fuzzy kmeans. Each model does retain the number of 
points that it captured in that iteration - not the points themselves - 
so it is possible to back-fit the points to see which were the most 
likely to be captured by using the model's pdf() function and taking the 
top n points. Of course, that won't scale but check out 
TestL1ModelClustering in utils/ for some code that I used.

The ClusterDumper is not able to dump the Dirichlet clusters though 
there is an issue to do this (MAHOUT-270) which is not yet completed. 
I'm working on it though, and you are welcome to make suggestions. 
Currently I'm trying to refactor the term priorities and other stuff in 
ClusterDumper to work with the Printable interface rather than relying 
upon ClusterBase.

The prototype and prototypeSize arguments give you a way to specify the 
class and size of the Vectors which underly the existing models. One 
could probably glean this information by inspecting the first data 
element presented to the algorithm at initialization time. There is at 
this time no connection between the Lucene index to Vector 
transformation in utils and the Dirichlet job in core/ and no obvious 
way to introduce one given the dependencies.

Code suggestions and patches to improve this all are of course welcome,
Jeff