You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by David Saile <da...@uni-koblenz.de> on 2011/05/26 19:35:40 UTC

CardinalityException during data clustering

Hi list,

As suggested in previous posts, I am trying to use k-means to assign newly arriving documents to existing clusters.

However, while trying to assign the vectors corresponding to the new documents to the existing clusters (using KMeansDriver.clusterData(…)), I am running into an org.apache.mahout.math.CardinalityException.
See below for the complete stack-trace. 

For vector creation I use Mahout's DictionaryVectorizer. 
I assume, this exception occurs because the new vectors have a different cardinality than the previously computed clusters.

Is there some way to assign a fixed cardinality to all vectors? Or is there any other solution for this?

I would really appreciate any help! Thanks,
David

 

java.lang.Exception: org.apache.mahout.math.CardinalityException: Required cardinality 16 but got 22
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:371)
Caused by: org.apache.mahout.math.CardinalityException: Required cardinality 16 but got 22
	at org.apache.mahout.math.RandomAccessSparseVector.dot(RandomAccessSparseVector.java:172)
	at org.apache.mahout.math.NamedVector.dot(NamedVector.java:127)
	at org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure.distance(SquaredEuclideanDistanceMeasure.java:57)
	at org.apache.mahout.clustering.kmeans.KMeansClusterer.outputPointWithClusterInfo(KMeansClusterer.java:140)
	at org.apache.mahout.clustering.kmeans.KMeansClusterMapper.map(KMeansClusterMapper.java:40)
	at org.apache.mahout.clustering.kmeans.KMeansClusterMapper.map(KMeansClusterMapper.java:1)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:652)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328)
	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:238)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:680)

Re: CardinalityException during data clustering

Posted by Ted Dunning <te...@gmail.com>.

The text value encoder has a special set of methods so that you can add text
that it tokenizes for you.  That is generally the easiest method.

You can tokenize it yourself and use the addToVector method if you like.
 Sometimes that is preferable because you may have a non-Lucene tokenizer or
you may want to avoid double tokenization (or a hundred other reasons).

On Fri, May 27, 2011 at 8:49 AM, David Saile <da...@uni-koblenz.de> wrote:

> I really appreciate your help Ted!
>
> As I am new to mahout, could you please point me into the right direction?
>
> From looking at the code I get the impression, that I would need to use the
> TextValueEncoder class and continuously call
> addToVector(String originalForm, double weight, Vector data)
> for each word in a given document. Is this correct?
>
>
> Am 27.05.2011 um 17:26 schrieb Ted Dunning:
>
> > You have to write or adapt some code.  This is the big current down-side
> of
> > the hashing encoders.
> >
> > On Fri, May 27, 2011 at 2:38 AM, David Saile <da...@uni-koblenz.de>
> wrote:
> >
> >>> The other option is to use the hashing encoders.  They inherently
> produce
> >>> output of fixed cardinality.  The down-side with that is that the
> meaning
> >> of
> >>> lots of distance measures is hard to understand in the hashed
> frameworks.
> >>> Distances that are invariant under linear transformations work
> perfectly.
> >>> Some others like Manhattan distance work pretty well.  Others can be
> >>> totally confused.
> >>
> >> This sounds like an option that eliminates the need for a global
> dictionary
> >> (in regards to multiple vecotrizer runs).
> >> How can I specify the use of hashing encoders for vectorization?
>
>

Re: CardinalityException during data clustering

Posted by David Saile <da...@uni-koblenz.de>.

I really appreciate your help Ted!

As I am new to mahout, could you please point me into the right direction?

From looking at the code I get the impression, that I would need to use the TextValueEncoder class and continuously call 
addToVector(String originalForm, double weight, Vector data)
for each word in a given document. Is this correct?

Am 27.05.2011 um 17:26 schrieb Ted Dunning:

> You have to write or adapt some code.  This is the big current down-side of
> the hashing encoders.
> 
> On Fri, May 27, 2011 at 2:38 AM, David Saile <da...@uni-koblenz.de> wrote:
> 
>>> The other option is to use the hashing encoders.  They inherently produce
>>> output of fixed cardinality.  The down-side with that is that the meaning
>> of
>>> lots of distance measures is hard to understand in the hashed frameworks.
>>> Distances that are invariant under linear transformations work perfectly.
>>> Some others like Manhattan distance work pretty well.  Others can be
>>> totally confused.
>> 
>> This sounds like an option that eliminates the need for a global dictionary
>> (in regards to multiple vecotrizer runs).
>> How can I specify the use of hashing encoders for vectorization?

Re: CardinalityException during data clustering

Posted by Ted Dunning <te...@gmail.com>.

You have to write or adapt some code.  This is the big current down-side of
the hashing encoders.

On Fri, May 27, 2011 at 2:38 AM, David Saile <da...@uni-koblenz.de> wrote:

> > The other option is to use the hashing encoders.  They inherently produce
> > output of fixed cardinality.  The down-side with that is that the meaning
> of
> > lots of distance measures is hard to understand in the hashed frameworks.
> > Distances that are invariant under linear transformations work perfectly.
> > Some others like Manhattan distance work pretty well.  Others can be
> > totally confused.
>
> This sounds like an option that eliminates the need for a global dictionary
> (in regards to multiple vecotrizer runs).
> How can I specify the use of hashing encoders for vectorization?

Re: CardinalityException during data clustering

Posted by David Saile <da...@uni-koblenz.de>.

Am 26.05.2011 um 20:05 schrieb Ted Dunning:

> On Thu, May 26, 2011 at 10:35 AM, David Saile <da...@uni-koblenz.de> wrote:
> 
>> I assume, this exception occurs because the new vectors have a different
>> cardinality than the previously computed clusters.
>> 
> 
> Correct
> 
> 
>> Is there some way to assign a fixed cardinality to all vectors? Or is there
>> any other solution for this?
>> 
> 
> I think that there is a way to use a fixed dictionary.  

I guess what you are referring to (what I actually overlooked), is that I need to use the dictionaries from previous runs, in order to ensure that words have consistent IDs.

Can someone point me to how I can pass an existing dictionary to the DictionaryVectorizer? 
In the mahout-0.4 release I am using, DictionaryVectorizer.createTermFrequencyVectors(…) does not take any dictionary-path argument.

> If we don't already have it, there should be a provision for adding an
> extra slot for unknown words to fit into.

I could not find this functionality, but I guess implementing this should not be too hard.    

> 
> The other option is to use the hashing encoders.  They inherently produce
> output of fixed cardinality.  The down-side with that is that the meaning of
> lots of distance measures is hard to understand in the hashed frameworks.
> Distances that are invariant under linear transformations work perfectly.
> Some others like Manhattan distance work pretty well.  Others can be
> totally confused.

This sounds like an option that eliminates the need for a global dictionary (in regards to multiple vecotrizer runs).
How can I specify the use of hashing encoders for vectorization?

Thanks for your help!

David

Re: CardinalityException during data clustering

Posted by Ted Dunning <te...@gmail.com>.

On Thu, May 26, 2011 at 10:35 AM, David Saile <da...@uni-koblenz.de> wrote:

> I assume, this exception occurs because the new vectors have a different
> cardinality than the previously computed clusters.
>

Correct

> Is there some way to assign a fixed cardinality to all vectors? Or is there
> any other solution for this?
>

I think that there is a way to use a fixed dictionary.  If we don't already
have it, there should be a provision for adding an extra slot for unknown
words to fit into.

The other option is to use the hashing encoders.  They inherently produce
output of fixed cardinality.  The down-side with that is that the meaning of
lots of distance measures is hard to understand in the hashed frameworks.
 Distances that are invariant under linear transformations work perfectly.
 Some others like Manhattan distance work pretty well.  Others can be
totally confused.

RE: CardinalityException during data clustering

Posted by Jeff Eastman <je...@Narus.com>.

Yes, your new documents have introduced new terms which have increased the size of the document vectors compared to the size of the cluster centers. If you convert your cluster centers to use sparse vectors with max_int size then you should be able to move forward. 

-----Original Message-----
From: David Saile [mailto:david@uni-koblenz.de] 
Sent: Thursday, May 26, 2011 10:36 AM
To: user@mahout.apache.org
Subject: CardinalityException during data clustering 

Hi list,

As suggested in previous posts, I am trying to use k-means to assign newly arriving documents to existing clusters.

However, while trying to assign the vectors corresponding to the new documents to the existing clusters (using KMeansDriver.clusterData(...)), I am running into an org.apache.mahout.math.CardinalityException.
See below for the complete stack-trace. 

For vector creation I use Mahout's DictionaryVectorizer. 
I assume, this exception occurs because the new vectors have a different cardinality than the previously computed clusters.

Is there some way to assign a fixed cardinality to all vectors? Or is there any other solution for this?

I would really appreciate any help! Thanks,
David

java.lang.Exception: org.apache.mahout.math.CardinalityException: Required cardinality 16 but got 22
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:371)
Caused by: org.apache.mahout.math.CardinalityException: Required cardinality 16 but got 22
	at org.apache.mahout.math.RandomAccessSparseVector.dot(RandomAccessSparseVector.java:172)
	at org.apache.mahout.math.NamedVector.dot(NamedVector.java:127)
	at org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure.distance(SquaredEuclideanDistanceMeasure.java:57)
	at org.apache.mahout.clustering.kmeans.KMeansClusterer.outputPointWithClusterInfo(KMeansClusterer.java:140)
	at org.apache.mahout.clustering.kmeans.KMeansClusterMapper.map(KMeansClusterMapper.java:40)
	at org.apache.mahout.clustering.kmeans.KMeansClusterMapper.map(KMeansClusterMapper.java:1)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:652)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328)
	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:238)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:680)