You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by rahul raghavendhra <ra...@gmail.com> on 2012/01/11 10:39:51 UTC

Synthetic control dataset clustering { Doubt Reg Cardinality }

Hi all, i have run that
org.apache.mahout.clustering.syntheticcontrol.<>.Job successfully.. when i
run with similar dataset(double values separated by ' ' (space))..

i got the error  org.apache.mahout.math.CardinalityException: Required
cardinality 16 but got 91

How this Cardinality is calculated and how it is passed  to kmeans driver..
how to calculate the cardinality for any dataset ?

please help


./rahul

Re: Synthetic control dataset clustering { Doubt Reg Cardinality }

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

It kinda looks to me like you have some inconsistent data in your input 
data set. Here's the sequence of operations that occur in Synthetic Control:

1. The InputDriver reads your input text files and produces sequence 
files of VectorWritable. Each vector is produced by the InputMapper 
after processing a single line of your input file and the size of the 
vector is determined on a per-line basis. If your lines do not all have 
the same number of entries then these vectors will be of differing size.
2. The RandomSeedGenerator is then run to sample from your input vectors 
to produce the clusters-0 directory which contains the starting 
conditions for K-Means. It does not care about the sizes of the vectors. 
If your sampled input vectors are not of the same size then your initial 
clusters will have different sized center  vectors.
3. Each K-Means iteration processes all points against all initial 
clusters. If your input vectors and your cluster center vectors are not 
the same size, then the distance measure calculation will throw the 
CardinalityException and the job will fail.

The stack trace you sent was for the job and not the mapper which threw 
the exception so it does not shed much light on the particulars of your 
problem. You could run this using -xm sequential in a debugger and 
easily find the source of your problem.

Jeff


On 1/11/12 4:01 AM, rahul raghavendhra wrote:
> when i tried with libra dataset i got the following error:
>
> MAHOUT-JOB:
> /usr/local/trunk/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
> 12/01/11 16:27:28 WARN driver.MahoutDriver: No
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.props found on
> classpath, will use command-line arguments only
> 12/01/11 16:27:28 INFO kmeans.Job: Running with default arguments
> 12/01/11 16:27:28 INFO common.HadoopUtil: Deleting output
> 12/01/11 16:27:28 INFO kmeans.Job: Preparing Input
> 12/01/11 16:27:29 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 12/01/11 16:27:30 INFO input.FileInputFormat: Total input paths to process
> : 1
> 12/01/11 16:27:30 INFO mapred.JobClient: Running job: job_201201111447_0013
> 12/01/11 16:27:31 INFO mapred.JobClient:  map 0% reduce 0%
> 12/01/11 16:27:47 INFO mapred.JobClient:  map 100% reduce 0%
> 12/01/11 16:27:49 INFO mapred.JobClient: Job complete: job_201201111447_0013
> 12/01/11 16:27:49 INFO mapred.JobClient: Counters: 7
> 12/01/11 16:27:49 INFO mapred.JobClient:   Job Counters
> 12/01/11 16:27:49 INFO mapred.JobClient:     Launched map tasks=1
> 12/01/11 16:27:49 INFO mapred.JobClient:     Data-local map tasks=1
> 12/01/11 16:27:49 INFO mapred.JobClient:   FileSystemCounters
> 12/01/11 16:27:49 INFO mapred.JobClient:     HDFS_BYTES_READ=32011
> 12/01/11 16:27:49 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=37855
> 12/01/11 16:27:49 INFO mapred.JobClient:   Map-Reduce Framework
> 12/01/11 16:27:49 INFO mapred.JobClient:     Map input records=46
> 12/01/11 16:27:49 INFO mapred.JobClient:     Spilled Records=0
> 12/01/11 16:27:49 INFO mapred.JobClient:     Map output records=45
> 12/01/11 16:27:49 INFO kmeans.Job: Running random seed to get initial
> clusters
> 12/01/11 16:27:49 INFO util.NativeCodeLoader: Loaded the native-hadoop
> library
> 12/01/11 16:27:49 INFO zlib.ZlibFactory: Successfully loaded&  initialized
> native-zlib library
> 12/01/11 16:27:49 INFO compress.CodecPool: Got brand-new compressor
> 12/01/11 16:27:49 INFO kmeans.RandomSeedGenerator: Wrote 6 vectors to
> output/clusters-0/part-randomSeed
> 12/01/11 16:27:49 INFO kmeans.Job: Running KMeans
> 12/01/11 16:27:49 INFO kmeans.KMeansDriver: Input: output/data Clusters In:
> output/clusters-0/part-randomSeed Out: output Distance:
> org.apache.mahout.common.distance.EuclideanDistanceMeasure
> 12/01/11 16:27:49 INFO kmeans.KMeansDriver: convergence: 0.5 max
> Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable
> Input Vectors: {}
> 12/01/11 16:27:49 INFO kmeans.KMeansDriver: K-Means Iteration 1
> 12/01/11 16:27:49 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 12/01/11 16:27:50 INFO input.FileInputFormat: Total input paths to process
> : 1
> 12/01/11 16:27:50 INFO mapred.JobClient: Running job: job_201201111447_0014
> 12/01/11 16:27:51 INFO mapred.JobClient:  map 0% reduce 0%
> 12/01/11 16:28:02 INFO mapred.JobClient:  map 100% reduce 0%
> 12/01/11 16:28:16 INFO mapred.JobClient: Task Id :
> attempt_201201111447_0014_r_000000_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 12/01/11 16:28:16 WARN mapred.JobClient: Error reading task
> outputConnection refused
> 12/01/11 16:28:16 WARN mapred.JobClient: Error reading task
> outputConnection refused
> 12/01/11 16:28:31 INFO mapred.JobClient: Task Id :
> attempt_201201111447_0014_r_000000_1, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 12/01/11 16:28:31 WARN mapred.JobClient: Error reading task
> outputConnection refused
> 12/01/11 16:28:31 WARN mapred.JobClient: Error reading task
> outputConnection refused
> 12/01/11 16:28:46 INFO mapred.JobClient: Task Id :
> attempt_201201111447_0014_r_000000_2, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 12/01/11 16:28:46 WARN mapred.JobClient: Error reading task
> outputConnection refused
> 12/01/11 16:28:46 WARN mapred.JobClient: Error reading task
> outputConnection refused
> 12/01/11 16:29:04 INFO mapred.JobClient: Job complete: job_201201111447_0014
> 12/01/11 16:29:04 INFO mapred.JobClient: Counters: 12
> 12/01/11 16:29:04 INFO mapred.JobClient:   Job Counters
> 12/01/11 16:29:04 INFO mapred.JobClient:     Launched reduce tasks=4
> 12/01/11 16:29:04 INFO mapred.JobClient:     Launched map tasks=1
> 12/01/11 16:29:04 INFO mapred.JobClient:     Data-local map tasks=1
> 12/01/11 16:29:04 INFO mapred.JobClient:     Failed reduce tasks=1
> 12/01/11 16:29:04 INFO mapred.JobClient:   FileSystemCounters
> 12/01/11 16:29:04 INFO mapred.JobClient:     HDFS_BYTES_READ=42649
> 12/01/11 16:29:04 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=10033
> 12/01/11 16:29:04 INFO mapred.JobClient:   Map-Reduce Framework
> 12/01/11 16:29:04 INFO mapred.JobClient:     Combine output records=6
> 12/01/11 16:29:04 INFO mapred.JobClient:     Map input records=45
> 12/01/11 16:29:04 INFO mapred.JobClient:     Spilled Records=6
> 12/01/11 16:29:04 INFO mapred.JobClient:     Map output bytes=74772
> 12/01/11 16:29:04 INFO mapred.JobClient:     Combine input records=45
> 12/01/11 16:29:04 INFO mapred.JobClient:     Map output records=45
> Exception in thread "main" java.lang.InterruptedException: K-Means
> Iteration failed processing output/clusters-0/part-randomSeed
>      at
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:371)
>      at
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:316)
>      at
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:239)
>      at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:154)
>      at
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.run(Job.java:146)
>      at
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:61)
>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>      at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>      at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>      at java.lang.reflect.Method.invoke(Method.java:601)
>      at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>      at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>      at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>      at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>      at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>      at java.lang.reflect.Method.invoke(Method.java:601)
>      at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
>
>
>
>
> On Wed, Jan 11, 2012 at 4:22 PM, Suneel Marthi<su...@yahoo.com>wrote:
>
>> a) It would help if you could post the exception stacktrace.
>>
>> b) CardinalityException is thrown when their is a cardinality mismatch
>> when performing vector or matrix operations.  For a KMeans driver this is
>> usually thrown when computing the Distance Measure when your feature
>> vectors for whom the distance is being computed are both of differing
>> dimensions.
>>
>>
>>
>>
>> ________________________________
>>   From: rahul raghavendhra<ra...@gmail.com>
>> To: user@mahout.apache.org; dev@mahout.apache.org
>> Sent: Wednesday, January 11, 2012 4:39 AM
>> Subject: Synthetic control dataset clustering { Doubt Reg Cardinality }
>>
>> Hi all, i have run that
>> org.apache.mahout.clustering.syntheticcontrol.<>.Job successfully.. when i
>> run with similar dataset(double values separated by ' ' (space))..
>>
>> i got the error  org.apache.mahout.math.CardinalityException: Required
>> cardinality 16 but got 91
>>
>> How this Cardinality is calculated and how it is passed  to kmeans driver..
>> how to calculate the cardinality for any dataset ?
>>
>> please help
>>
>>
>> ./rahul
>>

Re: Synthetic control dataset clustering { Doubt Reg Cardinality }

Posted by rahul raghavendhra <ra...@gmail.com>.

when i tried with libra dataset i got the following error:

MAHOUT-JOB:
/usr/local/trunk/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
12/01/11 16:27:28 WARN driver.MahoutDriver: No
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.props found on
classpath, will use command-line arguments only
12/01/11 16:27:28 INFO kmeans.Job: Running with default arguments
12/01/11 16:27:28 INFO common.HadoopUtil: Deleting output
12/01/11 16:27:28 INFO kmeans.Job: Preparing Input
12/01/11 16:27:29 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
12/01/11 16:27:30 INFO input.FileInputFormat: Total input paths to process
: 1
12/01/11 16:27:30 INFO mapred.JobClient: Running job: job_201201111447_0013
12/01/11 16:27:31 INFO mapred.JobClient:  map 0% reduce 0%
12/01/11 16:27:47 INFO mapred.JobClient:  map 100% reduce 0%
12/01/11 16:27:49 INFO mapred.JobClient: Job complete: job_201201111447_0013
12/01/11 16:27:49 INFO mapred.JobClient: Counters: 7
12/01/11 16:27:49 INFO mapred.JobClient:   Job Counters
12/01/11 16:27:49 INFO mapred.JobClient:     Launched map tasks=1
12/01/11 16:27:49 INFO mapred.JobClient:     Data-local map tasks=1
12/01/11 16:27:49 INFO mapred.JobClient:   FileSystemCounters
12/01/11 16:27:49 INFO mapred.JobClient:     HDFS_BYTES_READ=32011
12/01/11 16:27:49 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=37855
12/01/11 16:27:49 INFO mapred.JobClient:   Map-Reduce Framework
12/01/11 16:27:49 INFO mapred.JobClient:     Map input records=46
12/01/11 16:27:49 INFO mapred.JobClient:     Spilled Records=0
12/01/11 16:27:49 INFO mapred.JobClient:     Map output records=45
12/01/11 16:27:49 INFO kmeans.Job: Running random seed to get initial
clusters
12/01/11 16:27:49 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
12/01/11 16:27:49 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
12/01/11 16:27:49 INFO compress.CodecPool: Got brand-new compressor
12/01/11 16:27:49 INFO kmeans.RandomSeedGenerator: Wrote 6 vectors to
output/clusters-0/part-randomSeed
12/01/11 16:27:49 INFO kmeans.Job: Running KMeans
12/01/11 16:27:49 INFO kmeans.KMeansDriver: Input: output/data Clusters In:
output/clusters-0/part-randomSeed Out: output Distance:
org.apache.mahout.common.distance.EuclideanDistanceMeasure
12/01/11 16:27:49 INFO kmeans.KMeansDriver: convergence: 0.5 max
Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable
Input Vectors: {}
12/01/11 16:27:49 INFO kmeans.KMeansDriver: K-Means Iteration 1
12/01/11 16:27:49 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
12/01/11 16:27:50 INFO input.FileInputFormat: Total input paths to process
: 1
12/01/11 16:27:50 INFO mapred.JobClient: Running job: job_201201111447_0014
12/01/11 16:27:51 INFO mapred.JobClient:  map 0% reduce 0%
12/01/11 16:28:02 INFO mapred.JobClient:  map 100% reduce 0%
12/01/11 16:28:16 INFO mapred.JobClient: Task Id :
attempt_201201111447_0014_r_000000_0, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
12/01/11 16:28:16 WARN mapred.JobClient: Error reading task
outputConnection refused
12/01/11 16:28:16 WARN mapred.JobClient: Error reading task
outputConnection refused
12/01/11 16:28:31 INFO mapred.JobClient: Task Id :
attempt_201201111447_0014_r_000000_1, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
12/01/11 16:28:31 WARN mapred.JobClient: Error reading task
outputConnection refused
12/01/11 16:28:31 WARN mapred.JobClient: Error reading task
outputConnection refused
12/01/11 16:28:46 INFO mapred.JobClient: Task Id :
attempt_201201111447_0014_r_000000_2, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
12/01/11 16:28:46 WARN mapred.JobClient: Error reading task
outputConnection refused
12/01/11 16:28:46 WARN mapred.JobClient: Error reading task
outputConnection refused
12/01/11 16:29:04 INFO mapred.JobClient: Job complete: job_201201111447_0014
12/01/11 16:29:04 INFO mapred.JobClient: Counters: 12
12/01/11 16:29:04 INFO mapred.JobClient:   Job Counters
12/01/11 16:29:04 INFO mapred.JobClient:     Launched reduce tasks=4
12/01/11 16:29:04 INFO mapred.JobClient:     Launched map tasks=1
12/01/11 16:29:04 INFO mapred.JobClient:     Data-local map tasks=1
12/01/11 16:29:04 INFO mapred.JobClient:     Failed reduce tasks=1
12/01/11 16:29:04 INFO mapred.JobClient:   FileSystemCounters
12/01/11 16:29:04 INFO mapred.JobClient:     HDFS_BYTES_READ=42649
12/01/11 16:29:04 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=10033
12/01/11 16:29:04 INFO mapred.JobClient:   Map-Reduce Framework
12/01/11 16:29:04 INFO mapred.JobClient:     Combine output records=6
12/01/11 16:29:04 INFO mapred.JobClient:     Map input records=45
12/01/11 16:29:04 INFO mapred.JobClient:     Spilled Records=6
12/01/11 16:29:04 INFO mapred.JobClient:     Map output bytes=74772
12/01/11 16:29:04 INFO mapred.JobClient:     Combine input records=45
12/01/11 16:29:04 INFO mapred.JobClient:     Map output records=45
Exception in thread "main" java.lang.InterruptedException: K-Means
Iteration failed processing output/clusters-0/part-randomSeed
    at
org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:371)
    at
org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:316)
    at
org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:239)
    at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:154)
    at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.run(Job.java:146)
    at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:61)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)





On Wed, Jan 11, 2012 at 4:22 PM, Suneel Marthi <su...@yahoo.com>wrote:

> a) It would help if you could post the exception stacktrace.
>
> b) CardinalityException is thrown when their is a cardinality mismatch
> when performing vector or matrix operations.  For a KMeans driver this is
> usually thrown when computing the Distance Measure when your feature
> vectors for whom the distance is being computed are both of differing
> dimensions.
>
>
>
>
> ________________________________
>  From: rahul raghavendhra <ra...@gmail.com>
> To: user@mahout.apache.org; dev@mahout.apache.org
> Sent: Wednesday, January 11, 2012 4:39 AM
> Subject: Synthetic control dataset clustering { Doubt Reg Cardinality }
>
> Hi all, i have run that
> org.apache.mahout.clustering.syntheticcontrol.<>.Job successfully.. when i
> run with similar dataset(double values separated by ' ' (space))..
>
> i got the error  org.apache.mahout.math.CardinalityException: Required
> cardinality 16 but got 91
>
> How this Cardinality is calculated and how it is passed  to kmeans driver..
> how to calculate the cardinality for any dataset ?
>
> please help
>
>
> ./rahul
>

Re: Synthetic control dataset clustering { Doubt Reg Cardinality }

Posted by Suneel Marthi <su...@yahoo.com>.

a) It would help if you could post the exception stacktrace.

b) CardinalityException is thrown when their is a cardinality mismatch when performing vector or matrix operations.  For a KMeans driver this is usually thrown when computing the Distance Measure when your feature vectors for whom the distance is being computed are both of differing dimensions. 

________________________________
 From: rahul raghavendhra <ra...@gmail.com>
To: user@mahout.apache.org; dev@mahout.apache.org 
Sent: Wednesday, January 11, 2012 4:39 AM
Subject: Synthetic control dataset clustering { Doubt Reg Cardinality }

Hi all, i have run that
org.apache.mahout.clustering.syntheticcontrol.<>.Job successfully.. when i
run with similar dataset(double values separated by ' ' (space))..

i got the error  org.apache.mahout.math.CardinalityException: Required
cardinality 16 but got 91

How this Cardinality is calculated and how it is passed  to kmeans driver..
how to calculate the cardinality for any dataset ?

please help

./rahul