You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Bogdan Vatkov <bo...@gmail.com> on 2010/01/13 00:46:13 UTC

CardinalityException in DirichletDriver

what could be the reason for this Cardinality exception?

10/01/13 01:41:09 INFO clustering.SolrToMahoutDriver: Wrote: 174 vectors
10/01/13 01:41:09 INFO clustering.SolrToMahoutDriver: Dictionary Output
file:
/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/dictionary.txt
10/01/13 01:41:11 INFO dirichlet.DirichletDriver: Iteration 0
10/01/13 01:41:11 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
10/01/13 01:41:11 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
10/01/13 01:41:11 INFO mapred.FileInputFormat: Total input paths to process
: 1
10/01/13 01:41:11 INFO mapred.JobClient: Running job: job_local_0001
10/01/13 01:41:11 INFO mapred.FileInputFormat: Total input paths to process
: 1
10/01/13 01:41:11 INFO compress.CodecPool: Got brand-new decompressor
10/01/13 01:41:11 INFO mapred.MapTask: numReduceTasks: 1
10/01/13 01:41:11 INFO mapred.MapTask: io.sort.mb = 100
10/01/13 01:41:12 INFO mapred.MapTask: data buffer = 79691776/99614720
10/01/13 01:41:12 INFO mapred.MapTask: record buffer = 262144/327680
10/01/13 01:41:12 WARN mapred.LocalJobRunner: job_local_0001
org.apache.mahout.matrix.CardinalityException
at org.apache.mahout.matrix.AbstractVector.dot(AbstractVector.java:92)
at
org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:111)
at
org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:28)
at
org.apache.mahout.clustering.dirichlet.DirichletState.adjustedProbability(DirichletState.java:129)
at
org.apache.mahout.clustering.dirichlet.DirichletMapper.normalizedProbabilities(DirichletMapper.java:111)
at
org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:47)
at
org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:38)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
10/01/13 01:41:12 INFO mapred.JobClient:  map 0% reduce 0%
10/01/13 01:41:12 INFO mapred.JobClient: Job complete: job_local_0001
10/01/13 01:41:12 INFO mapred.JobClient: Counters: 0
10/01/13 01:41:12 WARN dirichlet.DirichletDriver: java.io.IOException: Job
failed!
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at
org.apache.mahout.clustering.dirichlet.DirichletDriver.runIteration(DirichletDriver.java:214)
at
org.apache.mahout.clustering.dirichlet.DirichletDriver.runJob(DirichletDriver.java:139)
at
org.apache.mahout.clustering.dirichlet.DirichletDriver.main(DirichletDriver.java:109)
at org.bogdan.clustering.mbeans.Clusters.doClustering(Clusters.java:244)
at org.bogdan.clustering.mbeans.Clusters.access$0(Clusters.java:175)
at org.bogdan.clustering.mbeans.Clusters$1.run(Clusters.java:148)
at java.lang.Thread.run(Thread.java:619)

Re: CardinalityException in DirichletDriver

Posted by Ted Dunning <te...@gmail.com>.

The idea of dot products and most vector implementations come from the
linear algebra world.  There, the key concept is vector space with fixed
number of dimensions (and dot products are only simple sums of products for
certain choices of coordinate system).  Essentially all implementations have
inherited this notion of dimension and consider it a serious problem when
dot'ing vectors of different dimension.

This is the same consideration that causes the multiplication of
non-conformable matrices to be considered an error.  You could think of
matrices as being unbounded in dimension, in which case all matrices are
conformable, but this is definitely not the traditional notion nor
implementation.

It is a nice side effect of our implementation that you can define a vector
size as MAX_INT, but that doesn't change the notion of conformability.

In the Dirichlet clustering the NormalModel is working very much in the
linear algebra mind-set.  Data vectors are linear algebra vectors that have
normal distributions that use them as the domain.  The internal parameters
of the normal distribution are vectors or matrices and the notion of
conformability is pretty important as a type check.

On Wed, Jan 13, 2010 at 12:18 PM, Sean Owen <sr...@gmail.com> wrote:

> Can I ask a dumb question -- why is this? conceptually vectors don't
> have a maximum size. ...
>
> Certainly particular implementations have a notion of maximum size:
> DenseVector. But I'd think it's an implementation-specific possible
> error case.
>
>

Re: CardinalityException in DirichletDriver

Posted by Jake Mannix <ja...@gmail.com>.

On Wed, Jan 13, 2010 at 1:59 PM, Ted Dunning <te...@gmail.com> wrote:

> Unless we go all the way down that road and make SparseMatrix live with the
> same trick, I would be against doing this by default.
>

Certainly - we need to be consistent whichever we do - if we decide to have
our
"default vector space" be R^{\inf} instead of R^{0}, we do it for everything
which
knows about vector spaces.

  -jake

Re: CardinalityException in DirichletDriver

Posted by Ted Dunning <te...@gmail.com>.

Unless we go all the way down that road and make SparseMatrix live with the
same trick, I would be against doing this by default.

On Wed, Jan 13, 2010 at 1:27 PM, Jake Mannix <ja...@gmail.com> wrote:

> You can certainly "turn it off" by making
> all of your (Sparse!) Vectors be "infinite" dimensional from the start.  I
> imagine we could do the reverse, and have it default to infinite
> dimensional and
> only when you construct them with explicit dimensions would you instead
> start
> doing this checking.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: CardinalityException in DirichletDriver

Posted by Jake Mannix <ja...@gmail.com>.

On Wed, Jan 13, 2010 at 12:18 PM, Sean Owen <sr...@gmail.com> wrote:

> Can I ask a dumb question -- why is this? conceptually vectors don't
> have a maximum size. They just have values at some dimensions and 0
> elsewhere. Dotting two vectors is always well-defined.
>

Conceptually vectors belong to some *fixed* vector space, which by
definition
has some fixed dimension (or is infinite dimensional).  Practically
speaking,
every finite dimensional vector space of the same dimension is isomorphic,
and they all live as subspaces of an infinite dimensional one, so you can
map them there (conceptually) and dot them there.  But mathematically
speaking
it doesn't make sense to dot product vectors from different spaces.

Programatically, in the past I've wanted to make sure that to avoid
programmer
error, Vector classes I've written were sometimes parametrized with a
marker interface (interface Vector<T extends VectorSpace>), forcing you to
only do vector operations between vectors of the same space, which gives
you compile time checking that you're not doing something silly (taking a
vector which was projected down to 100 dimensions and dotting it with a
vector which lives in your original 50,000 dimensional "term space", or
avoiding adding a word-bag vector to a document-bag vector, etc...).

I eventually found that such checks were great, and did help, but managing
the delicacies of writing APIs which were properly covariant w.r.t. the
VectorSpace typing (for collections of Vectors, and apis and methods which
took and returned collections of subclasses of vectors, etc etc...) was
more pain that it was worth.

A nice intermediate ground is getting at least runtime checking that you are
not messing up (which is what we have here in Mahout, and what the commons
math folk have, and most everyone).  You can certainly "turn it off" by
making
all of your (Sparse!) Vectors be "infinite" dimensional from the start.  I
imagine
we could do the reverse, and have it default to infinite dimensional and
only
when you construct them with explicit dimensions would you instead start
doing this checking.

  -jake

> Certainly particular implementations have a notion of maximum size:
> DenseVector. But I'd think it's an implementation-specific possible
> error case.

Re: CardinalityException in DirichletDriver

Posted by Sean Owen <sr...@gmail.com>.

Can I ask a dumb question -- why is this? conceptually vectors don't
have a maximum size. They just have values at some dimensions and 0
elsewhere. Dotting two vectors is always well-defined.

Certainly particular implementations have a notion of maximum size:
DenseVector. But I'd think it's an implementation-specific possible
error case.

On Wed, Jan 13, 2010 at 8:13 PM, Ted Dunning <te...@gmail.com> wrote:
> dot product is a vector operation that is the sum of products of
> corresponding elements of the two vectors being operated on.  If these
> vectors don't have the same length, then it is an error.

Re: CardinalityException in DirichletDriver

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Because of its non-deterministic nature, Dirichlet is darn hard to test. 
The 2-d tests offer the option of plotting out the points and the models 
and eyeballing the result 
(http://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html) but 
more rigorous testing and higher order problems in general are needed. 
There was a student on this list last summer who offered some pointed 
suggestions but he did not follow up and I've been under water in a startup.

Ted Dunning wrote:
> Because the unit tests were 2-dimensional examples.
>
> On Wed, Jan 13, 2010 at 3:34 PM, Jake Mannix <ja...@gmail.com> wrote:
>
>   
>> Ack, this is bad - why have we not caught this in unit tests?
>>     
>
>
>
>
>

Re: CardinalityException in DirichletDriver

Posted by Ted Dunning <te...@gmail.com>.

Because the unit tests were 2-dimensional examples.

On Wed, Jan 13, 2010 at 3:34 PM, Jake Mannix <ja...@gmail.com> wrote:

> Ack, this is bad - why have we not caught this in unit tests?




-- 
Ted Dunning, CTO
DeepDyve

Re: CardinalityException in DirichletDriver

Posted by Jake Mannix <ja...@gmail.com>.

On Wed, Jan 13, 2010 at 3:23 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> The NormalModelDistribution seems to still think all the data vectors are
> size=2.  In SampleFromPrior, it is creating models with that size.
> Subsequently, when you calculate the pdf with your data value (x) the sizes
> are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)',
> where n is your data cardinality. Please also look at the rest of the math
> in DenseVector with suspiscion. AFAIK, you are the first person to try to
> use Dirichlet.


Ack, this is bad - why have we not caught this in unit tests?

  -jake

Re: CardinalityException in DirichletDriver

Posted by Ted Dunning <te...@gmail.com>.

Dirichlet has some theoretical advantages (such as deducing how many
clusters are justified and providing non-deterministic answers when
ambiguity is present).  It has no run-time.  It probably is more delicate
with respect to parameter settings.

If you have some time budget, I think you could get some substantial
improvements.

If you are in a hurry, then k-means will work much better.

On Wed, Jan 13, 2010 at 3:26 PM, Bogdan Vatkov <bo...@gmail.com>wrote:

> But I am the first one to use Dirichlet which algorithm is the recommended
> one? Are all other algs better then Dirichlet so no one used it ;)?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: CardinalityException in DirichletDriver

Posted by Benson Margulies <bi...@gmail.com>.

OT: Every time I see this go by, I expect to see 'Cardinality' and 'Richelieu'.

Re: CardinalityException in DirichletDriver

Posted by Olivier Grisel <ol...@ensta.org>.

2010/1/19 Ted Dunning <te...@gmail.com>:
> Look at BinaryRandomizer (which implements TermRandomizer).
>
> On Tue, Jan 19, 2010 at 10:58 AM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
>
>> Looking in MAHOUT-228-3.patch, I don't see any sparse vectorizer. Did you
>> have another patch in mind?

I you plan to apply the MAHOUT-228-3 patch you can directly clone my git branch:

  http://github.com/ogrisel/mahout/commits/MAHOUT-228

I have also just added a sample hadoop driver to deterministically map
unbounded dimensional documents to some fixed dimensional space using
the random projection induced by the BinaryRandomizer.

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name

Re: CardinalityException in DirichletDriver

Posted by Ted Dunning <te...@gmail.com>.

Look at BinaryRandomizer (which implements TermRandomizer).

On Tue, Jan 19, 2010 at 10:58 AM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

> Looking in MAHOUT-228-3.patch, I don't see any sparse vectorizer. Did you
> have another patch in mind?




-- 
Ted Dunning, CTO
DeepDyve

Re: CardinalityException in DirichletDriver

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Bogdan,

I coded this up and wrote a unit test which appears to identify the 
correct number and shape of the test models. It does indeed preserve the 
model sparseness, so I committed it.

The Model Distribution uses a uniform, empty prior of the proper 
cardinality. The algorithm does not seem to mind this and converges 
pretty quickly on a stable set of models.

The unit test uses the Lucene utilities to compute TFIDF vectors which 
are input to the DirichletClusterer.

It would be interesting to see if it performs at all well on your more 
extensive data. Feel free to suggest improvements.

Jeff


Jeff Eastman wrote:
> Hi Ted,
>
> Ok, from this and looking at your code here is what I get:
>
> L1Model has a single, sparse coefficient vector M[t] where each 
> coefficient is the probability of that term being present in the 
> model. As (TF-IDF?) data values X[t] are scanned the pdf(X) for each 
> model would be exp(- ManhattanDistanceMeasure(M, X)). The list of pdfs 
> times the mixture probabilities is then sampled as a multinomial which 
> selects a particular model from the list of available models. When the 
> model then observes(X[t]), M=M+X and a count of observed values is 
> incremented. When computeParameters() is called, presumably M is 
> normalized (regularized?) and then sampled somehow to become the 
> posterior model for the next iteration.
>
> L1ModelDistribution needs to compute a list of models from its prior 
> and posterior distributions. What is known about each prior model? 
> M[t] should have some non-zero coefficients but we don't know which 
> ones? Seems like we could pick a few at random. Even if they are all 
> identical with empty Ms, the multinomial will still force the data 
> values into different models and, after the iteration is over, the 
> models will all be different and will diverge from each other as they 
> (hopefully) converge upon a description of the corpus. That's a little 
> like what kMeans does with random initial clusters and how Dirichlet 
> works with NormalModelDistributions (all prior models are identical 
> with zero mean coefficients).
>
> This has a lot of question marks in it but I'm pressing send anyhow,
> Jeff
>
>
> Ted Dunning wrote:
>> On Tue, Jan 19, 2010 at 10:58 AM, Jeff Eastman
>> <jd...@windwardsolutions.com>wrote:
>>
>>  
>>> Looking in MAHOUT-228-3.patch, I don't see any sparse vectorizer. 
>>> Did you
>>> have another patch in mind?
>>>
>>>     
>>
>> There should have been one.  Let me check to figure out the name.
>>
>>
>>  
>>> I'm trying to wrap my mind around "L-1 model distribution".
>>>     
>>
>>
>> For the classifier learning, what we have is a prior distribution for
>> classifiers that has probability proportional to exp(- 
>> sum(abs(w_i))).  The
>> log of this probability is - sum(abs(w_i)) = L_1(w) which gives the 
>> name.
>> This log probability is what is used as a regularization term in the
>> optimization of the classifier.
>>
>> It isn't obvious from this definition, but this prior/regularizer has 
>> the
>> effect of preferring sparse models (for classification).  Where L_2 
>> priors
>> prefer lots of small weights in ambiguous conditions because the 
>> penalty on
>> large coefficients is so large, L_1 priors prefer to focus the weight 
>> on one
>> or a few larger coefficients.
>>
>>
>>  
>>> .... Would an L-1 model vector only have integer-valued elements?
>>>
>>>     
>>
>> In the sense that 0 is an integer, yes.  :-)
>>
>> But what it prefers is zero valued coefficients.
>>
>>   
>
>

Re: CardinalityException in DirichletDriver

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hi Ted,

Ok, from this and looking at your code here is what I get:

L1Model has a single, sparse coefficient vector M[t] where each 
coefficient is the probability of that term being present in the model. 
As (TF-IDF?) data values X[t] are scanned the pdf(X) for each model 
would be exp(- ManhattanDistanceMeasure(M, X)). The list of pdfs times 
the mixture probabilities is then sampled as a multinomial which selects 
a particular model from the list of available models. When the model 
then observes(X[t]), M=M+X and a count of observed values is 
incremented. When computeParameters() is called, presumably M is 
normalized (regularized?) and then sampled somehow to become the 
posterior model for the next iteration.

L1ModelDistribution needs to compute a list of models from its prior and 
posterior distributions. What is known about each prior model? M[t] 
should have some non-zero coefficients but we don't know which ones? 
Seems like we could pick a few at random. Even if they are all identical 
with empty Ms, the multinomial will still force the data values into 
different models and, after the iteration is over, the models will all 
be different and will diverge from each other as they (hopefully) 
converge upon a description of the corpus. That's a little like what 
kMeans does with random initial clusters and how Dirichlet works with 
NormalModelDistributions (all prior models are identical with zero mean 
coefficients).

This has a lot of question marks in it but I'm pressing send anyhow,
Jeff

Ted Dunning wrote:
> On Tue, Jan 19, 2010 at 10:58 AM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
>
>   
>> Looking in MAHOUT-228-3.patch, I don't see any sparse vectorizer. Did you
>> have another patch in mind?
>>
>>     
>
> There should have been one.  Let me check to figure out the name.
>
>
>   
>> I'm trying to wrap my mind around "L-1 model distribution".
>>     
>
>
> For the classifier learning, what we have is a prior distribution for
> classifiers that has probability proportional to exp(- sum(abs(w_i))).  The
> log of this probability is - sum(abs(w_i)) = L_1(w) which gives the name.
> This log probability is what is used as a regularization term in the
> optimization of the classifier.
>
> It isn't obvious from this definition, but this prior/regularizer has the
> effect of preferring sparse models (for classification).  Where L_2 priors
> prefer lots of small weights in ambiguous conditions because the penalty on
> large coefficients is so large, L_1 priors prefer to focus the weight on one
> or a few larger coefficients.
>
>
>   
>> .... Would an L-1 model vector only have integer-valued elements?
>>
>>     
>
> In the sense that 0 is an integer, yes.  :-)
>
> But what it prefers is zero valued coefficients.
>
>

Re: CardinalityException in DirichletDriver

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Jan 19, 2010 at 10:58 AM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

>
> Looking in MAHOUT-228-3.patch, I don't see any sparse vectorizer. Did you
> have another patch in mind?
>

There should have been one.  Let me check to figure out the name.

> I'm trying to wrap my mind around "L-1 model distribution".

For the classifier learning, what we have is a prior distribution for
classifiers that has probability proportional to exp(- sum(abs(w_i))).  The
log of this probability is - sum(abs(w_i)) = L_1(w) which gives the name.
This log probability is what is used as a regularization term in the
optimization of the classifier.

It isn't obvious from this definition, but this prior/regularizer has the
effect of preferring sparse models (for classification).  Where L_2 priors
prefer lots of small weights in ambiguous conditions because the penalty on
large coefficients is so large, L_1 priors prefer to focus the weight on one
or a few larger coefficients.

> .... Would an L-1 model vector only have integer-valued elements?
>

In the sense that 0 is an integer, yes.  :-)

But what it prefers is zero valued coefficients.

Re: CardinalityException in DirichletDriver

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Ted,

Looking in MAHOUT-228-3.patch, I don't see any sparse vectorizer. Did 
you have another patch in mind?

I'm trying to wrap my mind around "L-1 model distribution". I recall the 
earlier discussions of L-n norms on list related to our distance 
measures but cannot connect the dots. Would an L-1 model vector only 
have integer-valued elements?

Jeff

Ted Dunning wrote:
> Highjacking the sparse vectorizer from the SGD patch might help with this.
> Likewise, using an L-1 model distribution would enforce sparseness by nature
> (I think).  Sampling from the L-1 prior might be a bit of a trip.
>
> On Mon, Jan 18, 2010 at 4:27 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
>
>   
>> I think you will need to bound your model dimensionality to use Dirichlet.
>> If you are using TF-IDF vectors to represent your documents I would think
>> these would all have the same maximum cardinality which you could specify
>> for the modelPrototype size. I just committed a new model distribution
>> (SparseNormalModelDistribution) that includes a heuristic
>> sampleFromPosterior() to remove small mean element values to preserve model
>> sparseness. It's probably bogus but a place to begin.
>>
>> I have also written one new unit test that runs in memory over a small,
>> 50-d sparse model and 100, 50-d sparse vectors. It does not explode.
>>
>> Just do another update before you begin to pick up those changes.
>>
>>
>> Bogdan Vatkov wrote:
>>
>>     
>>> Well, dimensions - I am just using slightly modified version of
>>> LuceneDriver
>>> (added stopword removal and regex removal of incoming terms), so I guess
>>> it
>>> is just a list of unidimentional vectors of random length.
>>> I will try to run the new code tomorrow.
>>>
>>> On Mon, Jan 18, 2010 at 10:18 PM, Jeff Eastman
>>> <jd...@windwardsolutions.com>wrote:
>>>
>>>
>>>
>>>       
>>>> Yes, they're all in trunk. Just do an svn update and mvn install to get
>>>> them.
>>>>
>>>> BTW, what's the dimensionality of your data?
>>>>
>>>> Jeff
>>>>
>>>>
>>>>
>>>> Bogdan Vatkov wrote:
>>>>
>>>>
>>>>
>>>>         
>>>>> Hi Jeff,
>>>>>
>>>>> I will try with the NormalModelDistribution but I am wondering how to
>>>>> obtain
>>>>> "MAHOUT-251", is this a tag in the SVN or how it is? how can I get the
>>>>> source containing the changes, do I simply sync from trunk? I suppose I
>>>>> have
>>>>> to run mvn install after that, right?
>>>>>
>>>>> Best regards,
>>>>> Bogdan
>>>>>
>>>>> On Mon, Jan 18, 2010 at 9:53 PM, Jeff Eastman <
>>>>> jdog@windwardsolutions.com
>>>>>
>>>>>
>>>>>           
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>             
>>>>>
>>>>>           
>>>>>> Bogdan,
>>>>>>
>>>>>> Recent resolution of MAHOUT-251 should allow you to experiment with
>>>>>> Dirichlet clustering for text models with arbitrary dimensionality. I
>>>>>> suggest starting with the NormalModelDistribution with a large sparse
>>>>>> vector
>>>>>> as its prototype.  The other model distributions create sampled values
>>>>>> for
>>>>>> all the prior model dimensions, negating any value of using sparse
>>>>>> vectors
>>>>>> for their prototypes.
>>>>>>
>>>>>> It may in fact be necessary to introduce a new ModelDistribution and
>>>>>> Model
>>>>>> so that sparse model elements will not fill up with insignificant
>>>>>> values.
>>>>>> After the first iteration computes the new posterior model parameters
>>>>>> from
>>>>>> the observations, many of these values will likely be small so some
>>>>>> heuristic would be needed to preserve model sparseness by removing them
>>>>>> altogether. If all these values are retained, it is probably better to
>>>>>> use a
>>>>>> dense vector representation. A 50k-dimensional model will be a real
>>>>>> compute
>>>>>> hog if it is not kept sparse somehow. Maybe sampleFromPosterior() or
>>>>>> sample() would be good places to embed this heuristic.
>>>>>>
>>>>>> I'll begin writing some tests to experiment with these models.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>
>>>>>
>>>>>           
>>>>         
>>>
>>>
>>>       
>>     
>
>
>

Re: CardinalityException in DirichletDriver

Posted by Ted Dunning <te...@gmail.com>.

Highjacking the sparse vectorizer from the SGD patch might help with this.
Likewise, using an L-1 model distribution would enforce sparseness by nature
(I think).  Sampling from the L-1 prior might be a bit of a trip.

On Mon, Jan 18, 2010 at 4:27 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> I think you will need to bound your model dimensionality to use Dirichlet.
> If you are using TF-IDF vectors to represent your documents I would think
> these would all have the same maximum cardinality which you could specify
> for the modelPrototype size. I just committed a new model distribution
> (SparseNormalModelDistribution) that includes a heuristic
> sampleFromPosterior() to remove small mean element values to preserve model
> sparseness. It's probably bogus but a place to begin.
>
> I have also written one new unit test that runs in memory over a small,
> 50-d sparse model and 100, 50-d sparse vectors. It does not explode.
>
> Just do another update before you begin to pick up those changes.
>
>
> Bogdan Vatkov wrote:
>
>> Well, dimensions - I am just using slightly modified version of
>> LuceneDriver
>> (added stopword removal and regex removal of incoming terms), so I guess
>> it
>> is just a list of unidimentional vectors of random length.
>> I will try to run the new code tomorrow.
>>
>> On Mon, Jan 18, 2010 at 10:18 PM, Jeff Eastman
>> <jd...@windwardsolutions.com>wrote:
>>
>>
>>
>>> Yes, they're all in trunk. Just do an svn update and mvn install to get
>>> them.
>>>
>>> BTW, what's the dimensionality of your data?
>>>
>>> Jeff
>>>
>>>
>>>
>>> Bogdan Vatkov wrote:
>>>
>>>
>>>
>>>> Hi Jeff,
>>>>
>>>> I will try with the NormalModelDistribution but I am wondering how to
>>>> obtain
>>>> "MAHOUT-251", is this a tag in the SVN or how it is? how can I get the
>>>> source containing the changes, do I simply sync from trunk? I suppose I
>>>> have
>>>> to run mvn install after that, right?
>>>>
>>>> Best regards,
>>>> Bogdan
>>>>
>>>> On Mon, Jan 18, 2010 at 9:53 PM, Jeff Eastman <
>>>> jdog@windwardsolutions.com
>>>>
>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>> Bogdan,
>>>>>
>>>>> Recent resolution of MAHOUT-251 should allow you to experiment with
>>>>> Dirichlet clustering for text models with arbitrary dimensionality. I
>>>>> suggest starting with the NormalModelDistribution with a large sparse
>>>>> vector
>>>>> as its prototype.  The other model distributions create sampled values
>>>>> for
>>>>> all the prior model dimensions, negating any value of using sparse
>>>>> vectors
>>>>> for their prototypes.
>>>>>
>>>>> It may in fact be necessary to introduce a new ModelDistribution and
>>>>> Model
>>>>> so that sparse model elements will not fill up with insignificant
>>>>> values.
>>>>> After the first iteration computes the new posterior model parameters
>>>>> from
>>>>> the observations, many of these values will likely be small so some
>>>>> heuristic would be needed to preserve model sparseness by removing them
>>>>> altogether. If all these values are retained, it is probably better to
>>>>> use a
>>>>> dense vector representation. A 50k-dimensional model will be a real
>>>>> compute
>>>>> hog if it is not kept sparse somehow. Maybe sampleFromPosterior() or
>>>>> sample() would be good places to embed this heuristic.
>>>>>
>>>>> I'll begin writing some tests to experiment with these models.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>>
>
>


-- 
Ted Dunning, CTO
DeepDyve

Re: CardinalityException in DirichletDriver

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

I think you will need to bound your model dimensionality to use 
Dirichlet. If you are using TF-IDF vectors to represent your documents I 
would think these would all have the same maximum cardinality which you 
could specify for the modelPrototype size. I just committed a new model 
distribution (SparseNormalModelDistribution) that includes a heuristic 
sampleFromPosterior() to remove small mean element values to preserve 
model sparseness. It's probably bogus but a place to begin.

I have also written one new unit test that runs in memory over a small, 
50-d sparse model and 100, 50-d sparse vectors. It does not explode.

Just do another update before you begin to pick up those changes.


Bogdan Vatkov wrote:
> Well, dimensions - I am just using slightly modified version of LuceneDriver
> (added stopword removal and regex removal of incoming terms), so I guess it
> is just a list of unidimentional vectors of random length.
> I will try to run the new code tomorrow.
>
> On Mon, Jan 18, 2010 at 10:18 PM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
>
>   
>> Yes, they're all in trunk. Just do an svn update and mvn install to get
>> them.
>>
>> BTW, what's the dimensionality of your data?
>>
>> Jeff
>>
>>
>>
>> Bogdan Vatkov wrote:
>>
>>     
>>> Hi Jeff,
>>>
>>> I will try with the NormalModelDistribution but I am wondering how to
>>> obtain
>>> "MAHOUT-251", is this a tag in the SVN or how it is? how can I get the
>>> source containing the changes, do I simply sync from trunk? I suppose I
>>> have
>>> to run mvn install after that, right?
>>>
>>> Best regards,
>>> Bogdan
>>>
>>> On Mon, Jan 18, 2010 at 9:53 PM, Jeff Eastman <jdog@windwardsolutions.com
>>>       
>>>> wrote:
>>>>         
>>>
>>>       
>>>> Bogdan,
>>>>
>>>> Recent resolution of MAHOUT-251 should allow you to experiment with
>>>> Dirichlet clustering for text models with arbitrary dimensionality. I
>>>> suggest starting with the NormalModelDistribution with a large sparse
>>>> vector
>>>> as its prototype.  The other model distributions create sampled values
>>>> for
>>>> all the prior model dimensions, negating any value of using sparse
>>>> vectors
>>>> for their prototypes.
>>>>
>>>> It may in fact be necessary to introduce a new ModelDistribution and
>>>> Model
>>>> so that sparse model elements will not fill up with insignificant values.
>>>> After the first iteration computes the new posterior model parameters
>>>> from
>>>> the observations, many of these values will likely be small so some
>>>> heuristic would be needed to preserve model sparseness by removing them
>>>> altogether. If all these values are retained, it is probably better to
>>>> use a
>>>> dense vector representation. A 50k-dimensional model will be a real
>>>> compute
>>>> hog if it is not kept sparse somehow. Maybe sampleFromPosterior() or
>>>> sample() would be good places to embed this heuristic.
>>>>
>>>> I'll begin writing some tests to experiment with these models.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>         
>>>
>>>
>>>       
>>     
>
>
>

Re: CardinalityException in DirichletDriver

Posted by Bogdan Vatkov <bo...@gmail.com>.

Well, dimensions - I am just using slightly modified version of LuceneDriver
(added stopword removal and regex removal of incoming terms), so I guess it
is just a list of unidimentional vectors of random length.
I will try to run the new code tomorrow.

On Mon, Jan 18, 2010 at 10:18 PM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

> Yes, they're all in trunk. Just do an svn update and mvn install to get
> them.
>
> BTW, what's the dimensionality of your data?
>
> Jeff
>
>
>
> Bogdan Vatkov wrote:
>
>> Hi Jeff,
>>
>> I will try with the NormalModelDistribution but I am wondering how to
>> obtain
>> "MAHOUT-251", is this a tag in the SVN or how it is? how can I get the
>> source containing the changes, do I simply sync from trunk? I suppose I
>> have
>> to run mvn install after that, right?
>>
>> Best regards,
>> Bogdan
>>
>> On Mon, Jan 18, 2010 at 9:53 PM, Jeff Eastman <jdog@windwardsolutions.com
>> >wrote:
>>
>>
>>
>>> Bogdan,
>>>
>>> Recent resolution of MAHOUT-251 should allow you to experiment with
>>> Dirichlet clustering for text models with arbitrary dimensionality. I
>>> suggest starting with the NormalModelDistribution with a large sparse
>>> vector
>>> as its prototype.  The other model distributions create sampled values
>>> for
>>> all the prior model dimensions, negating any value of using sparse
>>> vectors
>>> for their prototypes.
>>>
>>> It may in fact be necessary to introduce a new ModelDistribution and
>>> Model
>>> so that sparse model elements will not fill up with insignificant values.
>>> After the first iteration computes the new posterior model parameters
>>> from
>>> the observations, many of these values will likely be small so some
>>> heuristic would be needed to preserve model sparseness by removing them
>>> altogether. If all these values are retained, it is probably better to
>>> use a
>>> dense vector representation. A 50k-dimensional model will be a real
>>> compute
>>> hog if it is not kept sparse somehow. Maybe sampleFromPosterior() or
>>> sample() would be good places to embed this heuristic.
>>>
>>> I'll begin writing some tests to experiment with these models.
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>


-- 
Best regards,
Bogdan

Re: CardinalityException in DirichletDriver

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Yes, they're all in trunk. Just do an svn update and mvn install to get 
them.

BTW, what's the dimensionality of your data?

Jeff


Bogdan Vatkov wrote:
> Hi Jeff,
>
> I will try with the NormalModelDistribution but I am wondering how to obtain
> "MAHOUT-251", is this a tag in the SVN or how it is? how can I get the
> source containing the changes, do I simply sync from trunk? I suppose I have
> to run mvn install after that, right?
>
> Best regards,
> Bogdan
>
> On Mon, Jan 18, 2010 at 9:53 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
>
>   
>> Bogdan,
>>
>> Recent resolution of MAHOUT-251 should allow you to experiment with
>> Dirichlet clustering for text models with arbitrary dimensionality. I
>> suggest starting with the NormalModelDistribution with a large sparse vector
>> as its prototype.  The other model distributions create sampled values for
>> all the prior model dimensions, negating any value of using sparse vectors
>> for their prototypes.
>>
>> It may in fact be necessary to introduce a new ModelDistribution and Model
>> so that sparse model elements will not fill up with insignificant values.
>> After the first iteration computes the new posterior model parameters from
>> the observations, many of these values will likely be small so some
>> heuristic would be needed to preserve model sparseness by removing them
>> altogether. If all these values are retained, it is probably better to use a
>> dense vector representation. A 50k-dimensional model will be a real compute
>> hog if it is not kept sparse somehow. Maybe sampleFromPosterior() or
>> sample() would be good places to embed this heuristic.
>>
>> I'll begin writing some tests to experiment with these models.
>>
>>
>>
>>     
>
>
>

Re: CardinalityException in DirichletDriver

Posted by Bogdan Vatkov <bo...@gmail.com>.

Hi Jeff,

I will try with the NormalModelDistribution but I am wondering how to obtain
"MAHOUT-251", is this a tag in the SVN or how it is? how can I get the
source containing the changes, do I simply sync from trunk? I suppose I have
to run mvn install after that, right?

Best regards,
Bogdan

On Mon, Jan 18, 2010 at 9:53 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Bogdan,
>
> Recent resolution of MAHOUT-251 should allow you to experiment with
> Dirichlet clustering for text models with arbitrary dimensionality. I
> suggest starting with the NormalModelDistribution with a large sparse vector
> as its prototype.  The other model distributions create sampled values for
> all the prior model dimensions, negating any value of using sparse vectors
> for their prototypes.
>
> It may in fact be necessary to introduce a new ModelDistribution and Model
> so that sparse model elements will not fill up with insignificant values.
> After the first iteration computes the new posterior model parameters from
> the observations, many of these values will likely be small so some
> heuristic would be needed to preserve model sparseness by removing them
> altogether. If all these values are retained, it is probably better to use a
> dense vector representation. A 50k-dimensional model will be a real compute
> hog if it is not kept sparse somehow. Maybe sampleFromPosterior() or
> sample() would be good places to embed this heuristic.
>
> I'll begin writing some tests to experiment with these models.
>
>
>


-- 
Best regards,
Bogdan

Re: CardinalityException in DirichletDriver

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Bogdan,

Recent resolution of MAHOUT-251 should allow you to experiment with 
Dirichlet clustering for text models with arbitrary dimensionality. I 
suggest starting with the NormalModelDistribution with a large sparse 
vector as its prototype.  The other model distributions create sampled 
values for all the prior model dimensions, negating any value of using 
sparse vectors for their prototypes.

It may in fact be necessary to introduce a new ModelDistribution and 
Model so that sparse model elements will not fill up with insignificant 
values. After the first iteration computes the new posterior model 
parameters from the observations, many of these values will likely be 
small so some heuristic would be needed to preserve model sparseness by 
removing them altogether. If all these values are retained, it is 
probably better to use a dense vector representation. A 50k-dimensional 
model will be a real compute hog if it is not kept sparse somehow. Maybe 
sampleFromPosterior() or sample() would be good places to embed this 
heuristic.

I'll begin writing some tests to experiment with these models.

Re: CardinalityException in DirichletDriver

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Ah, ok, perhaps I will start with something similar and see how far I 
can get with Dirichlet.

Bogdan Vatkov wrote:
> unfortunately I am using private data which I cannot share. I am using
> emails, indexed by Solr and then creating vectors out of them. I am using
> them with k-means and everything is ok. Just wanted to try out the Dirichlet
> algorithm.
>
> On Thu, Jan 14, 2010 at 8:49 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
>
>   
>> I gather you are doing text clustering? Are you using one of our example
>> datasets or one which is publicly available?
>>
>>
>>
>> Bogdan Vatkov wrote:
>>
>>     
>>> Hi Jeff,
>>>
>>> What kind of details do you need to continue?
>>> In the mean time I am anyway going back to kmeans (maybe I really start
>>> with
>>> adding canopy to my kmeans only scenario first ;)).
>>>
>>> Best regards,
>>> Bogdan
>>>
>>> On Thu, Jan 14, 2010 at 1:49 AM, Jeff Eastman <jdog@windwardsolutions.com
>>>       
>>>> wrote:
>>>>         
>>>
>>>       
>>>> I think KMeans and Canopy are the most-used and therefore the most
>>>> robust.
>>>> Dirichlet still has not seen much use beyond some test examples and
>>>> NormalModel has at least one known problem (with sample() only returning
>>>> the
>>>> maximum likelihood) that has been reported but never fixed. Can you point
>>>> me
>>>> to the problem you are running so I can try to get up to speed? It has
>>>> been
>>>> some time since I worked in this code but I'm keen to do so and I have
>>>> some
>>>> time to invest.
>>>>
>>>> Jeff
>>>>
>>>>
>>>>
>>>> Bogdan Vatkov wrote:
>>>>
>>>>
>>>>
>>>>         
>>>>> But I am the first one to use Dirichlet which algorithm is the
>>>>> recommended
>>>>> one? Are all other algs better then Dirichlet so no one used it ;)?
>>>>>
>>>>> On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <
>>>>> jdog@windwardsolutions.com
>>>>>
>>>>>
>>>>>           
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>             
>>>>>
>>>>>           
>>>>>> The NormalModelDistribution seems to still think all the data vectors
>>>>>> are
>>>>>> size=2.  In SampleFromPrior, it is creating models with that size.
>>>>>> Subsequently, when you calculate the pdf with your data value (x) the
>>>>>> sizes
>>>>>> are incompatible. Suggest changing 'DenseVector(2)' to
>>>>>> 'DenseVector(n)',
>>>>>> where n is your data cardinality. Please also look at the rest of the
>>>>>> math
>>>>>> in DenseVector with suspiscion. AFAIK, you are the first person to try
>>>>>> to
>>>>>> use Dirichlet.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Bogdan Vatkov wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> I see a stack  when the size of the vectore mean is set to 2:
>>>>>>>
>>>>>>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
>>>>>>> NormalModel))
>>>>>>> NormalModel.<init>(Vector, double) line: 48
>>>>>>> NormalModelDistribution.sampleFromPrior(int) line: 33
>>>>>>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
>>>>>>> line:
>>>>>>> 48
>>>>>>> DirichletDriver.createState(String, int, double) line: 172
>>>>>>> DirichletDriver.writeInitialState(String, String, String, int, double)
>>>>>>> line:
>>>>>>> 150
>>>>>>> DirichletDriver.runJob(String, String, String, int, int, double, int)
>>>>>>> line:
>>>>>>> 133
>>>>>>> DirichletDriver.main(String[]) line: 109
>>>>>>> Clusters.doClustering() line: 244
>>>>>>> Clusters.access$0(Clusters) line: 175
>>>>>>> Clusters$1.run() line: 148
>>>>>>> Thread.run() line: 619
>>>>>>>
>>>>>>>
>>>>>>> public class NormalModelDistribution implements
>>>>>>> ModelDistribution<Vector>
>>>>>>> {
>>>>>>> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
>>>>>>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
>>>>>>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); }
>>>>>>> return
>>>>>>> result; }
>>>>>>>
>>>>>>> and later this vector is dotted to
>>>>>>>  @Override
>>>>>>>  public double pdf(Vector x) {
>>>>>>>  double sd2 = stdDev * stdDev;
>>>>>>>  double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
>>>>>>> sd2);
>>>>>>>  double ex = Math.exp(exp);
>>>>>>>  return ex / (stdDev * sqrt2pi);
>>>>>>>  }
>>>>>>>
>>>>>>> x vector which is coming from Hadoop MapRunner through the map
>>>>>>> function:
>>>>>>>
>>>>>>>  public void map(WritableComparable<?> key, Vector v,
>>>>>>>                OutputCollector<Text, Vector> output, Reporter
>>>>>>> reporter)
>>>>>>> throws IOException {
>>>>>>>
>>>>>>>
>>>>>>> any idea?
>>>>>>>
>>>>>>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it
>>>>>>> safe
>>>>>>> enough to run against trunk?
>>>>>>>
>>>>>>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <te...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <
>>>>>>>> bogdan.vatkov@gmail.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> wrote:
>>>>>>>>>    Sorry, what does that mean :)?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> It means that there is probably a programming bug somehow.  At the
>>>>>>>> very
>>>>>>>> least, the program is not robust with respect to strange invocations.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> what is a dotted vector? and why aren't they the same?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> dot product is a vector operation that is the sum of products of
>>>>>>>> corresponding elements of the two vectors being operated on.  If
>>>>>>>> these
>>>>>>>> vectors don't have the same length, then it is an error.
>>>>>>>>
>>>>>>>> what should I investigate?
>>>>>>>>  I am not familiar with the code, but if I had time to look, my
>>>>>>>> strategy
>>>>>>>> would be to start in the NormalModel and work back up the stack trace
>>>>>>>> to
>>>>>>>> find out how the vectors came to be different lengths.  No doubt, the
>>>>>>>> code
>>>>>>>> in NormalModel will not tell you anything, but you can see which
>>>>>>>> vectors
>>>>>>>> are
>>>>>>>> involved and by walking up the stack you may be able to see where
>>>>>>>> they
>>>>>>>> come
>>>>>>>> from.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> I am basically running my complete kmeans scenario (same input data,
>>>>>>>>> same
>>>>>>>>> number of clusters param, etc.) but just replacing KmeansDriver.main
>>>>>>>>> step
>>>>>>>>> with a DirichletDriver.main call...of course the arguments are
>>>>>>>>> adjusted
>>>>>>>>> since kmeans and dirichlet do not have the same arguments.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> I would think that this sounds very plausible.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> I am not sure what number I should give for the alpha argument,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> Alpha should have a value in the range from 0.01 to 20.  I would scan
>>>>>>>> with
>>>>>>>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e.
>>>>>>>> 0.01,
>>>>>>>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.
>>>>>>>>  The
>>>>>>>> effect of different values should be small over a pretty wide range.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> iterations
>>>>>>>>> and reductions...here is my current argument set:
>>>>>>>>>
>>>>>>>>> args = new String[] {
>>>>>>>>> "--input",
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> "--output", config.getClustersDir(),
>>>>>>>>> "--modelClass",
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>>>>>>>> "--maxIter", "15",
>>>>>>>>> "--alpha", "1.0",
>>>>>>>>> "--k", config.getClustersCount(),
>>>>>>>>> "--maxRed", "2"
>>>>>>>>> };
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> Not off-hand.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>             
>>>>>
>>>>>
>>>>>           
>>>>         
>>>
>>>
>>>       
>>     
>
>
>

Re: CardinalityException in DirichletDriver

Posted by Bogdan Vatkov <bo...@gmail.com>.

unfortunately I am using private data which I cannot share. I am using
emails, indexed by Solr and then creating vectors out of them. I am using
them with k-means and everything is ok. Just wanted to try out the Dirichlet
algorithm.

On Thu, Jan 14, 2010 at 8:49 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> I gather you are doing text clustering? Are you using one of our example
> datasets or one which is publicly available?
>
>
>
> Bogdan Vatkov wrote:
>
>> Hi Jeff,
>>
>> What kind of details do you need to continue?
>> In the mean time I am anyway going back to kmeans (maybe I really start
>> with
>> adding canopy to my kmeans only scenario first ;)).
>>
>> Best regards,
>> Bogdan
>>
>> On Thu, Jan 14, 2010 at 1:49 AM, Jeff Eastman <jdog@windwardsolutions.com
>> >wrote:
>>
>>
>>
>>> I think KMeans and Canopy are the most-used and therefore the most
>>> robust.
>>> Dirichlet still has not seen much use beyond some test examples and
>>> NormalModel has at least one known problem (with sample() only returning
>>> the
>>> maximum likelihood) that has been reported but never fixed. Can you point
>>> me
>>> to the problem you are running so I can try to get up to speed? It has
>>> been
>>> some time since I worked in this code but I'm keen to do so and I have
>>> some
>>> time to invest.
>>>
>>> Jeff
>>>
>>>
>>>
>>> Bogdan Vatkov wrote:
>>>
>>>
>>>
>>>> But I am the first one to use Dirichlet which algorithm is the
>>>> recommended
>>>> one? Are all other algs better then Dirichlet so no one used it ;)?
>>>>
>>>> On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <
>>>> jdog@windwardsolutions.com
>>>>
>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>> The NormalModelDistribution seems to still think all the data vectors
>>>>> are
>>>>> size=2.  In SampleFromPrior, it is creating models with that size.
>>>>> Subsequently, when you calculate the pdf with your data value (x) the
>>>>> sizes
>>>>> are incompatible. Suggest changing 'DenseVector(2)' to
>>>>> 'DenseVector(n)',
>>>>> where n is your data cardinality. Please also look at the rest of the
>>>>> math
>>>>> in DenseVector with suspiscion. AFAIK, you are the first person to try
>>>>> to
>>>>> use Dirichlet.
>>>>>
>>>>>
>>>>>
>>>>> Bogdan Vatkov wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I see a stack  when the size of the vectore mean is set to 2:
>>>>>>
>>>>>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
>>>>>> NormalModel))
>>>>>> NormalModel.<init>(Vector, double) line: 48
>>>>>> NormalModelDistribution.sampleFromPrior(int) line: 33
>>>>>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
>>>>>> line:
>>>>>> 48
>>>>>> DirichletDriver.createState(String, int, double) line: 172
>>>>>> DirichletDriver.writeInitialState(String, String, String, int, double)
>>>>>> line:
>>>>>> 150
>>>>>> DirichletDriver.runJob(String, String, String, int, int, double, int)
>>>>>> line:
>>>>>> 133
>>>>>> DirichletDriver.main(String[]) line: 109
>>>>>> Clusters.doClustering() line: 244
>>>>>> Clusters.access$0(Clusters) line: 175
>>>>>> Clusters$1.run() line: 148
>>>>>> Thread.run() line: 619
>>>>>>
>>>>>>
>>>>>> public class NormalModelDistribution implements
>>>>>> ModelDistribution<Vector>
>>>>>> {
>>>>>> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
>>>>>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
>>>>>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); }
>>>>>> return
>>>>>> result; }
>>>>>>
>>>>>> and later this vector is dotted to
>>>>>>  @Override
>>>>>>  public double pdf(Vector x) {
>>>>>>  double sd2 = stdDev * stdDev;
>>>>>>  double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
>>>>>> sd2);
>>>>>>  double ex = Math.exp(exp);
>>>>>>  return ex / (stdDev * sqrt2pi);
>>>>>>  }
>>>>>>
>>>>>> x vector which is coming from Hadoop MapRunner through the map
>>>>>> function:
>>>>>>
>>>>>>  public void map(WritableComparable<?> key, Vector v,
>>>>>>                OutputCollector<Text, Vector> output, Reporter
>>>>>> reporter)
>>>>>> throws IOException {
>>>>>>
>>>>>>
>>>>>> any idea?
>>>>>>
>>>>>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it
>>>>>> safe
>>>>>> enough to run against trunk?
>>>>>>
>>>>>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <te...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <
>>>>>>> bogdan.vatkov@gmail.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>    Sorry, what does that mean :)?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> It means that there is probably a programming bug somehow.  At the
>>>>>>> very
>>>>>>> least, the program is not robust with respect to strange invocations.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> what is a dotted vector? and why aren't they the same?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> dot product is a vector operation that is the sum of products of
>>>>>>> corresponding elements of the two vectors being operated on.  If
>>>>>>> these
>>>>>>> vectors don't have the same length, then it is an error.
>>>>>>>
>>>>>>> what should I investigate?
>>>>>>>  I am not familiar with the code, but if I had time to look, my
>>>>>>> strategy
>>>>>>> would be to start in the NormalModel and work back up the stack trace
>>>>>>> to
>>>>>>> find out how the vectors came to be different lengths.  No doubt, the
>>>>>>> code
>>>>>>> in NormalModel will not tell you anything, but you can see which
>>>>>>> vectors
>>>>>>> are
>>>>>>> involved and by walking up the stack you may be able to see where
>>>>>>> they
>>>>>>> come
>>>>>>> from.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> I am basically running my complete kmeans scenario (same input data,
>>>>>>>> same
>>>>>>>> number of clusters param, etc.) but just replacing KmeansDriver.main
>>>>>>>> step
>>>>>>>> with a DirichletDriver.main call...of course the arguments are
>>>>>>>> adjusted
>>>>>>>> since kmeans and dirichlet do not have the same arguments.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> I would think that this sounds very plausible.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> I am not sure what number I should give for the alpha argument,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Alpha should have a value in the range from 0.01 to 20.  I would scan
>>>>>>> with
>>>>>>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e.
>>>>>>> 0.01,
>>>>>>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.
>>>>>>>  The
>>>>>>> effect of different values should be small over a pretty wide range.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> iterations
>>>>>>>> and reductions...here is my current argument set:
>>>>>>>>
>>>>>>>> args = new String[] {
>>>>>>>> "--input",
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> "--output", config.getClustersDir(),
>>>>>>>> "--modelClass",
>>>>>>>>
>>>>>>>>
>>>>>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>>>>>>> "--maxIter", "15",
>>>>>>>> "--alpha", "1.0",
>>>>>>>> "--k", config.getClustersCount(),
>>>>>>>> "--maxRed", "2"
>>>>>>>> };
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Not off-hand.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>>
>
>


-- 
Best regards,
Bogdan

Re: CardinalityException in DirichletDriver

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

I gather you are doing text clustering? Are you using one of our example 
datasets or one which is publicly available?


Bogdan Vatkov wrote:
> Hi Jeff,
>
> What kind of details do you need to continue?
> In the mean time I am anyway going back to kmeans (maybe I really start with
> adding canopy to my kmeans only scenario first ;)).
>
> Best regards,
> Bogdan
>
> On Thu, Jan 14, 2010 at 1:49 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
>
>   
>> I think KMeans and Canopy are the most-used and therefore the most robust.
>> Dirichlet still has not seen much use beyond some test examples and
>> NormalModel has at least one known problem (with sample() only returning the
>> maximum likelihood) that has been reported but never fixed. Can you point me
>> to the problem you are running so I can try to get up to speed? It has been
>> some time since I worked in this code but I'm keen to do so and I have some
>> time to invest.
>>
>> Jeff
>>
>>
>>
>> Bogdan Vatkov wrote:
>>
>>     
>>> But I am the first one to use Dirichlet which algorithm is the recommended
>>> one? Are all other algs better then Dirichlet so no one used it ;)?
>>>
>>> On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <jdog@windwardsolutions.com
>>>       
>>>> wrote:
>>>>         
>>>
>>>       
>>>> The NormalModelDistribution seems to still think all the data vectors are
>>>> size=2.  In SampleFromPrior, it is creating models with that size.
>>>> Subsequently, when you calculate the pdf with your data value (x) the
>>>> sizes
>>>> are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)',
>>>> where n is your data cardinality. Please also look at the rest of the
>>>> math
>>>> in DenseVector with suspiscion. AFAIK, you are the first person to try to
>>>> use Dirichlet.
>>>>
>>>>
>>>>
>>>> Bogdan Vatkov wrote:
>>>>
>>>>
>>>>
>>>>         
>>>>> I see a stack  when the size of the vectore mean is set to 2:
>>>>>
>>>>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
>>>>> NormalModel))
>>>>> NormalModel.<init>(Vector, double) line: 48
>>>>> NormalModelDistribution.sampleFromPrior(int) line: 33
>>>>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
>>>>> line:
>>>>> 48
>>>>> DirichletDriver.createState(String, int, double) line: 172
>>>>> DirichletDriver.writeInitialState(String, String, String, int, double)
>>>>> line:
>>>>> 150
>>>>> DirichletDriver.runJob(String, String, String, int, int, double, int)
>>>>> line:
>>>>> 133
>>>>> DirichletDriver.main(String[]) line: 109
>>>>> Clusters.doClustering() line: 244
>>>>> Clusters.access$0(Clusters) line: 175
>>>>> Clusters$1.run() line: 148
>>>>> Thread.run() line: 619
>>>>>
>>>>>
>>>>> public class NormalModelDistribution implements
>>>>> ModelDistribution<Vector>
>>>>> {
>>>>> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
>>>>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
>>>>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); }
>>>>> return
>>>>> result; }
>>>>>
>>>>> and later this vector is dotted to
>>>>>  @Override
>>>>>  public double pdf(Vector x) {
>>>>>   double sd2 = stdDev * stdDev;
>>>>>   double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
>>>>> sd2);
>>>>>   double ex = Math.exp(exp);
>>>>>   return ex / (stdDev * sqrt2pi);
>>>>>  }
>>>>>
>>>>> x vector which is coming from Hadoop MapRunner through the map function:
>>>>>
>>>>>  public void map(WritableComparable<?> key, Vector v,
>>>>>                 OutputCollector<Text, Vector> output, Reporter reporter)
>>>>> throws IOException {
>>>>>
>>>>>
>>>>> any idea?
>>>>>
>>>>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it
>>>>> safe
>>>>> enough to run against trunk?
>>>>>
>>>>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <te...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <
>>>>>> bogdan.vatkov@gmail.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> wrote:
>>>>>>>     Sorry, what does that mean :)?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> It means that there is probably a programming bug somehow.  At the very
>>>>>> least, the program is not robust with respect to strange invocations.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> what is a dotted vector? and why aren't they the same?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> dot product is a vector operation that is the sum of products of
>>>>>> corresponding elements of the two vectors being operated on.  If these
>>>>>> vectors don't have the same length, then it is an error.
>>>>>>
>>>>>> what should I investigate?
>>>>>>   I am not familiar with the code, but if I had time to look, my
>>>>>> strategy
>>>>>> would be to start in the NormalModel and work back up the stack trace
>>>>>> to
>>>>>> find out how the vectors came to be different lengths.  No doubt, the
>>>>>> code
>>>>>> in NormalModel will not tell you anything, but you can see which
>>>>>> vectors
>>>>>> are
>>>>>> involved and by walking up the stack you may be able to see where they
>>>>>> come
>>>>>> from.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> I am basically running my complete kmeans scenario (same input data,
>>>>>>> same
>>>>>>> number of clusters param, etc.) but just replacing KmeansDriver.main
>>>>>>> step
>>>>>>> with a DirichletDriver.main call...of course the arguments are
>>>>>>> adjusted
>>>>>>> since kmeans and dirichlet do not have the same arguments.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> I would think that this sounds very plausible.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> I am not sure what number I should give for the alpha argument,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> Alpha should have a value in the range from 0.01 to 20.  I would scan
>>>>>> with
>>>>>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e.
>>>>>> 0.01,
>>>>>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.
>>>>>>  The
>>>>>> effect of different values should be small over a pretty wide range.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> iterations
>>>>>>> and reductions...here is my current argument set:
>>>>>>>
>>>>>>> args = new String[] {
>>>>>>> "--input",
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> "--output", config.getClustersDir(),
>>>>>>> "--modelClass",
>>>>>>>
>>>>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>>>>>> "--maxIter", "15",
>>>>>>> "--alpha", "1.0",
>>>>>>> "--k", config.getClustersCount(),
>>>>>>> "--maxRed", "2"
>>>>>>> };
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> Not off-hand.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>         
>>>
>>>
>>>       
>>     
>
>
>

Re: CardinalityException in DirichletDriver

Posted by Bogdan Vatkov <bo...@gmail.com>.

Hi Jeff,

What kind of details do you need to continue?
In the mean time I am anyway going back to kmeans (maybe I really start with
adding canopy to my kmeans only scenario first ;)).

Best regards,
Bogdan

On Thu, Jan 14, 2010 at 1:49 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> I think KMeans and Canopy are the most-used and therefore the most robust.
> Dirichlet still has not seen much use beyond some test examples and
> NormalModel has at least one known problem (with sample() only returning the
> maximum likelihood) that has been reported but never fixed. Can you point me
> to the problem you are running so I can try to get up to speed? It has been
> some time since I worked in this code but I'm keen to do so and I have some
> time to invest.
>
> Jeff
>
>
>
> Bogdan Vatkov wrote:
>
>> But I am the first one to use Dirichlet which algorithm is the recommended
>> one? Are all other algs better then Dirichlet so no one used it ;)?
>>
>> On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <jdog@windwardsolutions.com
>> >wrote:
>>
>>
>>
>>> The NormalModelDistribution seems to still think all the data vectors are
>>> size=2.  In SampleFromPrior, it is creating models with that size.
>>> Subsequently, when you calculate the pdf with your data value (x) the
>>> sizes
>>> are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)',
>>> where n is your data cardinality. Please also look at the rest of the
>>> math
>>> in DenseVector with suspiscion. AFAIK, you are the first person to try to
>>> use Dirichlet.
>>>
>>>
>>>
>>> Bogdan Vatkov wrote:
>>>
>>>
>>>
>>>> I see a stack  when the size of the vectore mean is set to 2:
>>>>
>>>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
>>>> NormalModel))
>>>> NormalModel.<init>(Vector, double) line: 48
>>>> NormalModelDistribution.sampleFromPrior(int) line: 33
>>>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
>>>> line:
>>>> 48
>>>> DirichletDriver.createState(String, int, double) line: 172
>>>> DirichletDriver.writeInitialState(String, String, String, int, double)
>>>> line:
>>>> 150
>>>> DirichletDriver.runJob(String, String, String, int, int, double, int)
>>>> line:
>>>> 133
>>>> DirichletDriver.main(String[]) line: 109
>>>> Clusters.doClustering() line: 244
>>>> Clusters.access$0(Clusters) line: 175
>>>> Clusters$1.run() line: 148
>>>> Thread.run() line: 619
>>>>
>>>>
>>>> public class NormalModelDistribution implements
>>>> ModelDistribution<Vector>
>>>> {
>>>> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
>>>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
>>>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); }
>>>> return
>>>> result; }
>>>>
>>>> and later this vector is dotted to
>>>>  @Override
>>>>  public double pdf(Vector x) {
>>>>   double sd2 = stdDev * stdDev;
>>>>   double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
>>>> sd2);
>>>>   double ex = Math.exp(exp);
>>>>   return ex / (stdDev * sqrt2pi);
>>>>  }
>>>>
>>>> x vector which is coming from Hadoop MapRunner through the map function:
>>>>
>>>>  public void map(WritableComparable<?> key, Vector v,
>>>>                 OutputCollector<Text, Vector> output, Reporter reporter)
>>>> throws IOException {
>>>>
>>>>
>>>> any idea?
>>>>
>>>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it
>>>> safe
>>>> enough to run against trunk?
>>>>
>>>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <
>>>>> bogdan.vatkov@gmail.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> wrote:
>>>>>>     Sorry, what does that mean :)?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> It means that there is probably a programming bug somehow.  At the very
>>>>> least, the program is not robust with respect to strange invocations.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> what is a dotted vector? and why aren't they the same?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> dot product is a vector operation that is the sum of products of
>>>>> corresponding elements of the two vectors being operated on.  If these
>>>>> vectors don't have the same length, then it is an error.
>>>>>
>>>>> what should I investigate?
>>>>>   I am not familiar with the code, but if I had time to look, my
>>>>> strategy
>>>>> would be to start in the NormalModel and work back up the stack trace
>>>>> to
>>>>> find out how the vectors came to be different lengths.  No doubt, the
>>>>> code
>>>>> in NormalModel will not tell you anything, but you can see which
>>>>> vectors
>>>>> are
>>>>> involved and by walking up the stack you may be able to see where they
>>>>> come
>>>>> from.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I am basically running my complete kmeans scenario (same input data,
>>>>>> same
>>>>>> number of clusters param, etc.) but just replacing KmeansDriver.main
>>>>>> step
>>>>>> with a DirichletDriver.main call...of course the arguments are
>>>>>> adjusted
>>>>>> since kmeans and dirichlet do not have the same arguments.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> I would think that this sounds very plausible.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I am not sure what number I should give for the alpha argument,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Alpha should have a value in the range from 0.01 to 20.  I would scan
>>>>> with
>>>>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e.
>>>>> 0.01,
>>>>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.
>>>>>  The
>>>>> effect of different values should be small over a pretty wide range.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> iterations
>>>>>> and reductions...here is my current argument set:
>>>>>>
>>>>>> args = new String[] {
>>>>>> "--input",
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> "--output", config.getClustersDir(),
>>>>>> "--modelClass",
>>>>>>
>>>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>>>>> "--maxIter", "15",
>>>>>> "--alpha", "1.0",
>>>>>> "--k", config.getClustersCount(),
>>>>>> "--maxRed", "2"
>>>>>> };
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Not off-hand.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>>
>
>


-- 
Best regards,
Bogdan

Re: CardinalityException in DirichletDriver

Posted by Liang Chenmin <li...@gmail.com>.

I had similar bugs before. And found out that it's due to some changes in my
code, which generate two vectors with different lengths. You could print out
some logs and look in the data generated to check.

On Wed, Jan 13, 2010 at 3:49 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> I think KMeans and Canopy are the most-used and therefore the most robust.
> Dirichlet still has not seen much use beyond some test examples and
> NormalModel has at least one known problem (with sample() only returning the
> maximum likelihood) that has been reported but never fixed. Can you point me
> to the problem you are running so I can try to get up to speed? It has been
> some time since I worked in this code but I'm keen to do so and I have some
> time to invest.
>
> Jeff
>
>
>
> Bogdan Vatkov wrote:
>
>> But I am the first one to use Dirichlet which algorithm is the recommended
>> one? Are all other algs better then Dirichlet so no one used it ;)?
>>
>> On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <jdog@windwardsolutions.com
>> >wrote:
>>
>>
>>
>>> The NormalModelDistribution seems to still think all the data vectors are
>>> size=2.  In SampleFromPrior, it is creating models with that size.
>>> Subsequently, when you calculate the pdf with your data value (x) the
>>> sizes
>>> are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)',
>>> where n is your data cardinality. Please also look at the rest of the
>>> math
>>> in DenseVector with suspiscion. AFAIK, you are the first person to try to
>>> use Dirichlet.
>>>
>>>
>>>
>>> Bogdan Vatkov wrote:
>>>
>>>
>>>
>>>> I see a stack  when the size of the vectore mean is set to 2:
>>>>
>>>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
>>>> NormalModel))
>>>> NormalModel.<init>(Vector, double) line: 48
>>>> NormalModelDistribution.sampleFromPrior(int) line: 33
>>>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
>>>> line:
>>>> 48
>>>> DirichletDriver.createState(String, int, double) line: 172
>>>> DirichletDriver.writeInitialState(String, String, String, int, double)
>>>> line:
>>>> 150
>>>> DirichletDriver.runJob(String, String, String, int, int, double, int)
>>>> line:
>>>> 133
>>>> DirichletDriver.main(String[]) line: 109
>>>> Clusters.doClustering() line: 244
>>>> Clusters.access$0(Clusters) line: 175
>>>> Clusters$1.run() line: 148
>>>> Thread.run() line: 619
>>>>
>>>>
>>>> public class NormalModelDistribution implements
>>>> ModelDistribution<Vector>
>>>> {
>>>> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
>>>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
>>>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); }
>>>> return
>>>> result; }
>>>>
>>>> and later this vector is dotted to
>>>>  @Override
>>>>  public double pdf(Vector x) {
>>>>   double sd2 = stdDev * stdDev;
>>>>   double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
>>>> sd2);
>>>>   double ex = Math.exp(exp);
>>>>   return ex / (stdDev * sqrt2pi);
>>>>  }
>>>>
>>>> x vector which is coming from Hadoop MapRunner through the map function:
>>>>
>>>>  public void map(WritableComparable<?> key, Vector v,
>>>>                 OutputCollector<Text, Vector> output, Reporter reporter)
>>>> throws IOException {
>>>>
>>>>
>>>> any idea?
>>>>
>>>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it
>>>> safe
>>>> enough to run against trunk?
>>>>
>>>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <
>>>>> bogdan.vatkov@gmail.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> wrote:
>>>>>>     Sorry, what does that mean :)?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> It means that there is probably a programming bug somehow.  At the very
>>>>> least, the program is not robust with respect to strange invocations.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> what is a dotted vector? and why aren't they the same?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> dot product is a vector operation that is the sum of products of
>>>>> corresponding elements of the two vectors being operated on.  If these
>>>>> vectors don't have the same length, then it is an error.
>>>>>
>>>>> what should I investigate?
>>>>>   I am not familiar with the code, but if I had time to look, my
>>>>> strategy
>>>>> would be to start in the NormalModel and work back up the stack trace
>>>>> to
>>>>> find out how the vectors came to be different lengths.  No doubt, the
>>>>> code
>>>>> in NormalModel will not tell you anything, but you can see which
>>>>> vectors
>>>>> are
>>>>> involved and by walking up the stack you may be able to see where they
>>>>> come
>>>>> from.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I am basically running my complete kmeans scenario (same input data,
>>>>>> same
>>>>>> number of clusters param, etc.) but just replacing KmeansDriver.main
>>>>>> step
>>>>>> with a DirichletDriver.main call...of course the arguments are
>>>>>> adjusted
>>>>>> since kmeans and dirichlet do not have the same arguments.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> I would think that this sounds very plausible.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I am not sure what number I should give for the alpha argument,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Alpha should have a value in the range from 0.01 to 20.  I would scan
>>>>> with
>>>>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e.
>>>>> 0.01,
>>>>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.
>>>>>  The
>>>>> effect of different values should be small over a pretty wide range.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> iterations
>>>>>> and reductions...here is my current argument set:
>>>>>>
>>>>>> args = new String[] {
>>>>>> "--input",
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> "--output", config.getClustersDir(),
>>>>>> "--modelClass",
>>>>>>
>>>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>>>>> "--maxIter", "15",
>>>>>> "--alpha", "1.0",
>>>>>> "--k", config.getClustersCount(),
>>>>>> "--maxRed", "2"
>>>>>> };
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Not off-hand.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>>
>
>


-- 
Chenmin Liang
Language Technologies Institute, School of Computer Science
Carnegie Mellon University

Re: CardinalityException in DirichletDriver

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

I think KMeans and Canopy are the most-used and therefore the most 
robust. Dirichlet still has not seen much use beyond some test examples 
and NormalModel has at least one known problem (with sample() only 
returning the maximum likelihood) that has been reported but never 
fixed. Can you point me to the problem you are running so I can try to 
get up to speed? It has been some time since I worked in this code but 
I'm keen to do so and I have some time to invest.

Jeff


Bogdan Vatkov wrote:
> But I am the first one to use Dirichlet which algorithm is the recommended
> one? Are all other algs better then Dirichlet so no one used it ;)?
>
> On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
>
>   
>> The NormalModelDistribution seems to still think all the data vectors are
>> size=2.  In SampleFromPrior, it is creating models with that size.
>> Subsequently, when you calculate the pdf with your data value (x) the sizes
>> are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)',
>> where n is your data cardinality. Please also look at the rest of the math
>> in DenseVector with suspiscion. AFAIK, you are the first person to try to
>> use Dirichlet.
>>
>>
>>
>> Bogdan Vatkov wrote:
>>
>>     
>>> I see a stack  when the size of the vectore mean is set to 2:
>>>
>>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
>>> NormalModel))
>>> NormalModel.<init>(Vector, double) line: 48
>>> NormalModelDistribution.sampleFromPrior(int) line: 33
>>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
>>> line:
>>> 48
>>> DirichletDriver.createState(String, int, double) line: 172
>>> DirichletDriver.writeInitialState(String, String, String, int, double)
>>> line:
>>> 150
>>> DirichletDriver.runJob(String, String, String, int, int, double, int)
>>> line:
>>> 133
>>> DirichletDriver.main(String[]) line: 109
>>> Clusters.doClustering() line: 244
>>> Clusters.access$0(Clusters) line: 175
>>> Clusters$1.run() line: 148
>>> Thread.run() line: 619
>>>
>>>
>>> public class NormalModelDistribution implements ModelDistribution<Vector>
>>> {
>>> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
>>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
>>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); }
>>> return
>>> result; }
>>>
>>> and later this vector is dotted to
>>>  @Override
>>>  public double pdf(Vector x) {
>>>    double sd2 = stdDev * stdDev;
>>>    double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
>>> sd2);
>>>    double ex = Math.exp(exp);
>>>    return ex / (stdDev * sqrt2pi);
>>>  }
>>>
>>> x vector which is coming from Hadoop MapRunner through the map function:
>>>
>>>  public void map(WritableComparable<?> key, Vector v,
>>>                  OutputCollector<Text, Vector> output, Reporter reporter)
>>> throws IOException {
>>>
>>>
>>> any idea?
>>>
>>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it
>>> safe
>>> enough to run against trunk?
>>>
>>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>
>>>
>>>
>>>       
>>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <bogdan.vatkov@gmail.com
>>>>
>>>>
>>>>         
>>>>> wrote:
>>>>>      Sorry, what does that mean :)?
>>>>>
>>>>>
>>>>>
>>>>>           
>>>> It means that there is probably a programming bug somehow.  At the very
>>>> least, the program is not robust with respect to strange invocations.
>>>>
>>>>
>>>>
>>>>
>>>>         
>>>>> what is a dotted vector? and why aren't they the same?
>>>>>
>>>>>
>>>>>
>>>>>           
>>>> dot product is a vector operation that is the sum of products of
>>>> corresponding elements of the two vectors being operated on.  If these
>>>> vectors don't have the same length, then it is an error.
>>>>
>>>> what should I investigate?
>>>>    I am not familiar with the code, but if I had time to look, my
>>>> strategy
>>>> would be to start in the NormalModel and work back up the stack trace to
>>>> find out how the vectors came to be different lengths.  No doubt, the
>>>> code
>>>> in NormalModel will not tell you anything, but you can see which vectors
>>>> are
>>>> involved and by walking up the stack you may be able to see where they
>>>> come
>>>> from.
>>>>
>>>>
>>>>
>>>>
>>>>         
>>>>> I am basically running my complete kmeans scenario (same input data,
>>>>> same
>>>>> number of clusters param, etc.) but just replacing KmeansDriver.main
>>>>> step
>>>>> with a DirichletDriver.main call...of course the arguments are adjusted
>>>>> since kmeans and dirichlet do not have the same arguments.
>>>>>
>>>>>
>>>>>
>>>>>           
>>>> I would think that this sounds very plausible.
>>>>
>>>>
>>>>
>>>>
>>>>         
>>>>> I am not sure what number I should give for the alpha argument,
>>>>>
>>>>>
>>>>>           
>>>> Alpha should have a value in the range from 0.01 to 20.  I would scan
>>>> with
>>>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01,
>>>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.
>>>>  The
>>>> effect of different values should be small over a pretty wide range.
>>>>
>>>>
>>>>
>>>>
>>>>         
>>>>> iterations
>>>>> and reductions...here is my current argument set:
>>>>>
>>>>> args = new String[] {
>>>>> "--input",
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>>>
>>>>
>>>>         
>>>>> "--output", config.getClustersDir(),
>>>>> "--modelClass",
>>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>>>> "--maxIter", "15",
>>>>> "--alpha", "1.0",
>>>>> "--k", config.getClustersCount(),
>>>>> "--maxRed", "2"
>>>>> };
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>>>> Not off-hand.
>>>>
>>>>
>>>>
>>>>         
>>>
>>>
>>>
>>>       
>>     
>
>
>

Re: CardinalityException in DirichletDriver

Posted by Bogdan Vatkov <bo...@gmail.com>.

But I am the first one to use Dirichlet which algorithm is the recommended
one? Are all other algs better then Dirichlet so no one used it ;)?

On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> The NormalModelDistribution seems to still think all the data vectors are
> size=2.  In SampleFromPrior, it is creating models with that size.
> Subsequently, when you calculate the pdf with your data value (x) the sizes
> are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)',
> where n is your data cardinality. Please also look at the rest of the math
> in DenseVector with suspiscion. AFAIK, you are the first person to try to
> use Dirichlet.
>
>
>
> Bogdan Vatkov wrote:
>
>> I see a stack  when the size of the vectore mean is set to 2:
>>
>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
>> NormalModel))
>> NormalModel.<init>(Vector, double) line: 48
>> NormalModelDistribution.sampleFromPrior(int) line: 33
>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
>> line:
>> 48
>> DirichletDriver.createState(String, int, double) line: 172
>> DirichletDriver.writeInitialState(String, String, String, int, double)
>> line:
>> 150
>> DirichletDriver.runJob(String, String, String, int, int, double, int)
>> line:
>> 133
>> DirichletDriver.main(String[]) line: 109
>> Clusters.doClustering() line: 244
>> Clusters.access$0(Clusters) line: 175
>> Clusters$1.run() line: 148
>> Thread.run() line: 619
>>
>>
>> public class NormalModelDistribution implements ModelDistribution<Vector>
>> {
>> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); }
>> return
>> result; }
>>
>> and later this vector is dotted to
>>  @Override
>>  public double pdf(Vector x) {
>>    double sd2 = stdDev * stdDev;
>>    double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
>> sd2);
>>    double ex = Math.exp(exp);
>>    return ex / (stdDev * sqrt2pi);
>>  }
>>
>> x vector which is coming from Hadoop MapRunner through the map function:
>>
>>  public void map(WritableComparable<?> key, Vector v,
>>                  OutputCollector<Text, Vector> output, Reporter reporter)
>> throws IOException {
>>
>>
>> any idea?
>>
>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it
>> safe
>> enough to run against trunk?
>>
>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>>
>>
>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <bogdan.vatkov@gmail.com
>>>
>>>
>>>> wrote:
>>>>      Sorry, what does that mean :)?
>>>>
>>>>
>>>>
>>> It means that there is probably a programming bug somehow.  At the very
>>> least, the program is not robust with respect to strange invocations.
>>>
>>>
>>>
>>>
>>>> what is a dotted vector? and why aren't they the same?
>>>>
>>>>
>>>>
>>> dot product is a vector operation that is the sum of products of
>>> corresponding elements of the two vectors being operated on.  If these
>>> vectors don't have the same length, then it is an error.
>>>
>>> what should I investigate?
>>>    I am not familiar with the code, but if I had time to look, my
>>> strategy
>>> would be to start in the NormalModel and work back up the stack trace to
>>> find out how the vectors came to be different lengths.  No doubt, the
>>> code
>>> in NormalModel will not tell you anything, but you can see which vectors
>>> are
>>> involved and by walking up the stack you may be able to see where they
>>> come
>>> from.
>>>
>>>
>>>
>>>
>>>> I am basically running my complete kmeans scenario (same input data,
>>>> same
>>>> number of clusters param, etc.) but just replacing KmeansDriver.main
>>>> step
>>>> with a DirichletDriver.main call...of course the arguments are adjusted
>>>> since kmeans and dirichlet do not have the same arguments.
>>>>
>>>>
>>>>
>>> I would think that this sounds very plausible.
>>>
>>>
>>>
>>>
>>>> I am not sure what number I should give for the alpha argument,
>>>>
>>>>
>>> Alpha should have a value in the range from 0.01 to 20.  I would scan
>>> with
>>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01,
>>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.
>>>  The
>>> effect of different values should be small over a pretty wide range.
>>>
>>>
>>>
>>>
>>>> iterations
>>>> and reductions...here is my current argument set:
>>>>
>>>> args = new String[] {
>>>> "--input",
>>>>
>>>>
>>>>
>>>>
>>>
>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>>
>>>
>>>> "--output", config.getClustersDir(),
>>>> "--modelClass",
>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>>> "--maxIter", "15",
>>>> "--alpha", "1.0",
>>>> "--k", config.getClustersCount(),
>>>> "--maxRed", "2"
>>>> };
>>>>
>>>>
>>>>
>>>>
>>> Not off-hand.
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>


-- 
Best regards,
Bogdan

Re: CardinalityException in DirichletDriver

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

The NormalModelDistribution seems to still think all the data vectors 
are size=2.  In SampleFromPrior, it is creating models with that size. 
Subsequently, when you calculate the pdf with your data value (x) the 
sizes are incompatible. Suggest changing 'DenseVector(2)' to 
'DenseVector(n)', where n is your data cardinality. Please also look at 
the rest of the math in DenseVector with suspiscion. AFAIK, you are the 
first person to try to use Dirichlet.


Bogdan Vatkov wrote:
> I see a stack  when the size of the vectore mean is set to 2:
>
> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in NormalModel))
> NormalModel.<init>(Vector, double) line: 48
> NormalModelDistribution.sampleFromPrior(int) line: 33
> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int) line:
> 48
> DirichletDriver.createState(String, int, double) line: 172
> DirichletDriver.writeInitialState(String, String, String, int, double) line:
> 150
> DirichletDriver.runJob(String, String, String, int, int, double, int) line:
> 133
> DirichletDriver.main(String[]) line: 109
> Clusters.doClustering() line: 244
> Clusters.access$0(Clusters) line: 175
> Clusters$1.run() line: 148
> Thread.run() line: 619
>
>
> public class NormalModelDistribution implements ModelDistribution<Vector> {
> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); } return
> result; }
>
> and later this vector is dotted to
>   @Override
>   public double pdf(Vector x) {
>     double sd2 = stdDev * stdDev;
>     double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 * sd2);
>     double ex = Math.exp(exp);
>     return ex / (stdDev * sqrt2pi);
>   }
>
> x vector which is coming from Hadoop MapRunner through the map function:
>
>   public void map(WritableComparable<?> key, Vector v,
>                   OutputCollector<Text, Vector> output, Reporter reporter)
> throws IOException {
>
>
> any idea?
>
> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it safe
> enough to run against trunk?
>
> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <te...@gmail.com> wrote:
>
>   
>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <bogdan.vatkov@gmail.com
>>     
>>> wrote:
>>>       
>>> Sorry, what does that mean :)?
>>>
>>>       
>> It means that there is probably a programming bug somehow.  At the very
>> least, the program is not robust with respect to strange invocations.
>>
>>
>>     
>>> what is a dotted vector? and why aren't they the same?
>>>
>>>       
>> dot product is a vector operation that is the sum of products of
>> corresponding elements of the two vectors being operated on.  If these
>> vectors don't have the same length, then it is an error.
>>
>> what should I investigate?
>>     
>> I am not familiar with the code, but if I had time to look, my strategy
>> would be to start in the NormalModel and work back up the stack trace to
>> find out how the vectors came to be different lengths.  No doubt, the code
>> in NormalModel will not tell you anything, but you can see which vectors
>> are
>> involved and by walking up the stack you may be able to see where they come
>> from.
>>
>>
>>     
>>> I am basically running my complete kmeans scenario (same input data, same
>>> number of clusters param, etc.) but just replacing KmeansDriver.main step
>>> with a DirichletDriver.main call...of course the arguments are adjusted
>>> since kmeans and dirichlet do not have the same arguments.
>>>
>>>       
>> I would think that this sounds very plausible.
>>
>>
>>     
>>> I am not sure what number I should give for the alpha argument,
>>>       
>> Alpha should have a value in the range from 0.01 to 20.  I would scan with
>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01,
>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.  The
>> effect of different values should be small over a pretty wide range.
>>
>>
>>     
>>> iterations
>>> and reductions...here is my current argument set:
>>>
>>> args = new String[] {
>>> "--input",
>>>
>>>
>>>       
>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>     
>>> "--output", config.getClustersDir(),
>>> "--modelClass",
>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>> "--maxIter", "15",
>>> "--alpha", "1.0",
>>> "--k", config.getClustersCount(),
>>> "--maxRed", "2"
>>> };
>>>
>>>
>>>       
>> Not off-hand.
>>
>>     
>
>
>
>

Re: CardinalityException in DirichletDriver

Posted by Bogdan Vatkov <bo...@gmail.com>.

ok, just reproduced w/ code from trunk :|

On Wed, Jan 13, 2010 at 11:07 PM, Bogdan Vatkov <bo...@gmail.com>wrote:

> I see a stack  when the size of the vectore mean is set to 2:
>
> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in NormalModel))
>  NormalModel.<init>(Vector, double) line: 48
> NormalModelDistribution.sampleFromPrior(int) line: 33
>  DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
> line: 48
>  DirichletDriver.createState(String, int, double) line: 172
> DirichletDriver.writeInitialState(String, String, String, int, double)
> line: 150
>  DirichletDriver.runJob(String, String, String, int, int, double, int)
> line: 133
> DirichletDriver.main(String[]) line: 109
>  Clusters.doClustering() line: 244
> Clusters.access$0(Clusters) line: 175
>  Clusters$1.run() line: 148
> Thread.run() line: 619
>
>
> public class NormalModelDistribution implements ModelDistribution<Vector> {
> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); } return
> result; }
>
> and later this vector is dotted to
>   @Override
>   public double pdf(Vector x) {
>     double sd2 = stdDev * stdDev;
>     double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
> sd2);
>     double ex = Math.exp(exp);
>     return ex / (stdDev * sqrt2pi);
>   }
>
> x vector which is coming from Hadoop MapRunner through the map function:
>
>   public void map(WritableComparable<?> key, Vector v,
>                   OutputCollector<Text, Vector> output, Reporter reporter)
> throws IOException {
>
>
> any idea?
>
> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it safe
> enough to run against trunk?
>
> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <bogdan.vatkov@gmail.com
>> >wrote:
>>
>> > Sorry, what does that mean :)?
>> >
>>
>> It means that there is probably a programming bug somehow.  At the very
>> least, the program is not robust with respect to strange invocations.
>>
>>
>> > what is a dotted vector? and why aren't they the same?
>> >
>>
>> dot product is a vector operation that is the sum of products of
>> corresponding elements of the two vectors being operated on.  If these
>> vectors don't have the same length, then it is an error.
>>
>> what should I investigate?
>> >
>>
>> I am not familiar with the code, but if I had time to look, my strategy
>> would be to start in the NormalModel and work back up the stack trace to
>> find out how the vectors came to be different lengths.  No doubt, the code
>> in NormalModel will not tell you anything, but you can see which vectors
>> are
>> involved and by walking up the stack you may be able to see where they
>> come
>> from.
>>
>>
>> > I am basically running my complete kmeans scenario (same input data,
>> same
>> > number of clusters param, etc.) but just replacing KmeansDriver.main
>> step
>> > with a DirichletDriver.main call...of course the arguments are adjusted
>> > since kmeans and dirichlet do not have the same arguments.
>> >
>>
>> I would think that this sounds very plausible.
>>
>>
>> > I am not sure what number I should give for the alpha argument,
>>
>>
>> Alpha should have a value in the range from 0.01 to 20.  I would scan with
>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01,
>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.  The
>> effect of different values should be small over a pretty wide range.
>>
>>
>> > iterations
>> > and reductions...here is my current argument set:
>> >
>> > args = new String[] {
>> > "--input",
>> >
>> >
>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>> > "--output", config.getClustersDir(),
>> > "--modelClass",
>> > "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>> > "--maxIter", "15",
>> > "--alpha", "1.0",
>> > "--k", config.getClustersCount(),
>> > "--maxRed", "2"
>> > };
>> >
>> >
>> Not off-hand.
>>
>
>
>
> --
> Best regards,
> Bogdan
>
>


-- 
Best regards,
Bogdan

Re: CardinalityException in DirichletDriver

Posted by Bogdan Vatkov <bo...@gmail.com>.

I see a stack  when the size of the vectore mean is set to 2:

Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in NormalModel))
NormalModel.<init>(Vector, double) line: 48
NormalModelDistribution.sampleFromPrior(int) line: 33
DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int) line:
48
DirichletDriver.createState(String, int, double) line: 172
DirichletDriver.writeInitialState(String, String, String, int, double) line:
150
DirichletDriver.runJob(String, String, String, int, int, double, int) line:
133
DirichletDriver.main(String[]) line: 109
Clusters.doClustering() line: 244
Clusters.access$0(Clusters) line: 175
Clusters$1.run() line: 148
Thread.run() line: 619


public class NormalModelDistribution implements ModelDistribution<Vector> {
@Override public Model<Vector>[] sampleFromPrior(int howMany) {
Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); } return
result; }

and later this vector is dotted to
  @Override
  public double pdf(Vector x) {
    double sd2 = stdDev * stdDev;
    double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 * sd2);
    double ex = Math.exp(exp);
    return ex / (stdDev * sqrt2pi);
  }

x vector which is coming from Hadoop MapRunner through the map function:

  public void map(WritableComparable<?> key, Vector v,
                  OutputCollector<Text, Vector> output, Reporter reporter)
throws IOException {


any idea?

btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it safe
enough to run against trunk?

On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <te...@gmail.com> wrote:

> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <bogdan.vatkov@gmail.com
> >wrote:
>
> > Sorry, what does that mean :)?
> >
>
> It means that there is probably a programming bug somehow.  At the very
> least, the program is not robust with respect to strange invocations.
>
>
> > what is a dotted vector? and why aren't they the same?
> >
>
> dot product is a vector operation that is the sum of products of
> corresponding elements of the two vectors being operated on.  If these
> vectors don't have the same length, then it is an error.
>
> what should I investigate?
> >
>
> I am not familiar with the code, but if I had time to look, my strategy
> would be to start in the NormalModel and work back up the stack trace to
> find out how the vectors came to be different lengths.  No doubt, the code
> in NormalModel will not tell you anything, but you can see which vectors
> are
> involved and by walking up the stack you may be able to see where they come
> from.
>
>
> > I am basically running my complete kmeans scenario (same input data, same
> > number of clusters param, etc.) but just replacing KmeansDriver.main step
> > with a DirichletDriver.main call...of course the arguments are adjusted
> > since kmeans and dirichlet do not have the same arguments.
> >
>
> I would think that this sounds very plausible.
>
>
> > I am not sure what number I should give for the alpha argument,
>
>
> Alpha should have a value in the range from 0.01 to 20.  I would scan with
> 1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01,
> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.  The
> effect of different values should be small over a pretty wide range.
>
>
> > iterations
> > and reductions...here is my current argument set:
> >
> > args = new String[] {
> > "--input",
> >
> >
> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
> > "--output", config.getClustersDir(),
> > "--modelClass",
> > "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
> > "--maxIter", "15",
> > "--alpha", "1.0",
> > "--k", config.getClustersCount(),
> > "--maxRed", "2"
> > };
> >
> >
> Not off-hand.
>



-- 
Best regards,
Bogdan

Re: CardinalityException in DirichletDriver

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <bo...@gmail.com>wrote:

> Sorry, what does that mean :)?
>

It means that there is probably a programming bug somehow.  At the very
least, the program is not robust with respect to strange invocations.

> what is a dotted vector? and why aren't they the same?
>

dot product is a vector operation that is the sum of products of
corresponding elements of the two vectors being operated on.  If these
vectors don't have the same length, then it is an error.

what should I investigate?
>

I am not familiar with the code, but if I had time to look, my strategy
would be to start in the NormalModel and work back up the stack trace to
find out how the vectors came to be different lengths.  No doubt, the code
in NormalModel will not tell you anything, but you can see which vectors are
involved and by walking up the stack you may be able to see where they come
from.

> I am basically running my complete kmeans scenario (same input data, same
> number of clusters param, etc.) but just replacing KmeansDriver.main step
> with a DirichletDriver.main call...of course the arguments are adjusted
> since kmeans and dirichlet do not have the same arguments.
>

I would think that this sounds very plausible.

> I am not sure what number I should give for the alpha argument,

Alpha should have a value in the range from 0.01 to 20.  I would scan with
1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01,
0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.  The
effect of different values should be small over a pretty wide range.

> iterations
> and reductions...here is my current argument set:
>
> args = new String[] {
> "--input",
>
> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
> "--output", config.getClustersDir(),
> "--modelClass",
> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
> "--maxIter", "15",
> "--alpha", "1.0",
> "--k", config.getClustersCount(),
> "--maxRed", "2"
> };
>
>
Not off-hand.

Re: CardinalityException in DirichletDriver

Posted by Bogdan Vatkov <bo...@gmail.com>.

Sorry, what does that mean :)?
what is a dotted vector? and why aren't they the same?
what should I investigate?
I am basically running my complete kmeans scenario (same input data, same
number of clusters param, etc.) but just replacing KmeansDriver.main step
with a DirichletDriver.main call...of course the arguments are adjusted
since kmeans and dirichlet do not have the same arguments.
I am not sure what number I should give for the alpha argument, iterations
and reductions...here is my current argument set:

args = new String[] {
"--input",
"/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
"--output", config.getClustersDir(),
"--modelClass",
"org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
"--maxIter", "15",
"--alpha", "1.0",
"--k", config.getClustersCount(),
"--maxRed", "2"
};

anything suspicious in there?
On Wed, Jan 13, 2010 at 2:44 AM, Grant Ingersoll <gs...@apache.org>wrote:

> I don't have the code in front of me, but if I had to guess based on the
> location of the stack trace, I'm going to guess it is b/c the sizes of the
> two vectors being "dotted" aren't the same.
>
> On Jan 12, 2010, at 6:46 PM, Bogdan Vatkov wrote:
>
> > what could be the reason for this Cardinality exception?
> >
> > 10/01/13 01:41:09 INFO clustering.SolrToMahoutDriver: Wrote: 174 vectors
> > 10/01/13 01:41:09 INFO clustering.SolrToMahoutDriver: Dictionary Output
> > file:
> > /store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/dictionary.txt
> > 10/01/13 01:41:11 INFO dirichlet.DirichletDriver: Iteration 0
> > 10/01/13 01:41:11 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> > processName=JobTracker, sessionId=
> > 10/01/13 01:41:11 WARN mapred.JobClient: Use GenericOptionsParser for
> > parsing the arguments. Applications should implement Tool for the same.
> > 10/01/13 01:41:11 INFO mapred.FileInputFormat: Total input paths to
> process
> > : 1
> > 10/01/13 01:41:11 INFO mapred.JobClient: Running job: job_local_0001
> > 10/01/13 01:41:11 INFO mapred.FileInputFormat: Total input paths to
> process
> > : 1
> > 10/01/13 01:41:11 INFO compress.CodecPool: Got brand-new decompressor
> > 10/01/13 01:41:11 INFO mapred.MapTask: numReduceTasks: 1
> > 10/01/13 01:41:11 INFO mapred.MapTask: io.sort.mb = 100
> > 10/01/13 01:41:12 INFO mapred.MapTask: data buffer = 79691776/99614720
> > 10/01/13 01:41:12 INFO mapred.MapTask: record buffer = 262144/327680
> > 10/01/13 01:41:12 WARN mapred.LocalJobRunner: job_local_0001
> > org.apache.mahout.matrix.CardinalityException
> > at org.apache.mahout.matrix.AbstractVector.dot(AbstractVector.java:92)
> > at
> >
> org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:111)
> > at
> >
> org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:28)
> > at
> >
> org.apache.mahout.clustering.dirichlet.DirichletState.adjustedProbability(DirichletState.java:129)
> > at
> >
> org.apache.mahout.clustering.dirichlet.DirichletMapper.normalizedProbabilities(DirichletMapper.java:111)
> > at
> >
> org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:47)
> > at
> >
> org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:38)
> > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> > at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
> > 10/01/13 01:41:12 INFO mapred.JobClient:  map 0% reduce 0%
> > 10/01/13 01:41:12 INFO mapred.JobClient: Job complete: job_local_0001
> > 10/01/13 01:41:12 INFO mapred.JobClient: Counters: 0
> > 10/01/13 01:41:12 WARN dirichlet.DirichletDriver: java.io.IOException:
> Job
> > failed!
> > java.io.IOException: Job failed!
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
> > at
> >
> org.apache.mahout.clustering.dirichlet.DirichletDriver.runIteration(DirichletDriver.java:214)
> > at
> >
> org.apache.mahout.clustering.dirichlet.DirichletDriver.runJob(DirichletDriver.java:139)
> > at
> >
> org.apache.mahout.clustering.dirichlet.DirichletDriver.main(DirichletDriver.java:109)
> > at org.bogdan.clustering.mbeans.Clusters.doClustering(Clusters.java:244)
> > at org.bogdan.clustering.mbeans.Clusters.access$0(Clusters.java:175)
> > at org.bogdan.clustering.mbeans.Clusters$1.run(Clusters.java:148)
> > at java.lang.Thread.run(Thread.java:619)
>
>


-- 
Best regards,
Bogdan

Re: CardinalityException in DirichletDriver

Posted by Grant Ingersoll <gs...@apache.org>.

I don't have the code in front of me, but if I had to guess based on the location of the stack trace, I'm going to guess it is b/c the sizes of the two vectors being "dotted" aren't the same.

On Jan 12, 2010, at 6:46 PM, Bogdan Vatkov wrote:

> what could be the reason for this Cardinality exception?
> 
> 10/01/13 01:41:09 INFO clustering.SolrToMahoutDriver: Wrote: 174 vectors
> 10/01/13 01:41:09 INFO clustering.SolrToMahoutDriver: Dictionary Output
> file:
> /store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/dictionary.txt
> 10/01/13 01:41:11 INFO dirichlet.DirichletDriver: Iteration 0
> 10/01/13 01:41:11 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=JobTracker, sessionId=
> 10/01/13 01:41:11 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 10/01/13 01:41:11 INFO mapred.FileInputFormat: Total input paths to process
> : 1
> 10/01/13 01:41:11 INFO mapred.JobClient: Running job: job_local_0001
> 10/01/13 01:41:11 INFO mapred.FileInputFormat: Total input paths to process
> : 1
> 10/01/13 01:41:11 INFO compress.CodecPool: Got brand-new decompressor
> 10/01/13 01:41:11 INFO mapred.MapTask: numReduceTasks: 1
> 10/01/13 01:41:11 INFO mapred.MapTask: io.sort.mb = 100
> 10/01/13 01:41:12 INFO mapred.MapTask: data buffer = 79691776/99614720
> 10/01/13 01:41:12 INFO mapred.MapTask: record buffer = 262144/327680
> 10/01/13 01:41:12 WARN mapred.LocalJobRunner: job_local_0001
> org.apache.mahout.matrix.CardinalityException
> at org.apache.mahout.matrix.AbstractVector.dot(AbstractVector.java:92)
> at
> org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:111)
> at
> org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:28)
> at
> org.apache.mahout.clustering.dirichlet.DirichletState.adjustedProbability(DirichletState.java:129)
> at
> org.apache.mahout.clustering.dirichlet.DirichletMapper.normalizedProbabilities(DirichletMapper.java:111)
> at
> org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:47)
> at
> org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:38)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
> 10/01/13 01:41:12 INFO mapred.JobClient:  map 0% reduce 0%
> 10/01/13 01:41:12 INFO mapred.JobClient: Job complete: job_local_0001
> 10/01/13 01:41:12 INFO mapred.JobClient: Counters: 0
> 10/01/13 01:41:12 WARN dirichlet.DirichletDriver: java.io.IOException: Job
> failed!
> java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
> at
> org.apache.mahout.clustering.dirichlet.DirichletDriver.runIteration(DirichletDriver.java:214)
> at
> org.apache.mahout.clustering.dirichlet.DirichletDriver.runJob(DirichletDriver.java:139)
> at
> org.apache.mahout.clustering.dirichlet.DirichletDriver.main(DirichletDriver.java:109)
> at org.bogdan.clustering.mbeans.Clusters.doClustering(Clusters.java:244)
> at org.bogdan.clustering.mbeans.Clusters.access$0(Clusters.java:175)
> at org.bogdan.clustering.mbeans.Clusters$1.run(Clusters.java:148)
> at java.lang.Thread.run(Thread.java:619)