You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sascha Nordquist <sa...@nordquist.de> on 2010/11/26 18:53:05 UTC
Dirichlet Clustering failed
Hi,
when I run Dirichlet Clustering I get the following Exception:
org.apache.mahout.math.CardinalityException: Required cardinality 10672
but got 10
at
org.apache.mahout.math.RandomAccessSparseVector.dot(RandomAccessSparseVector.java:172)
at org.apache.mahout.math.NamedVector.dot(NamedVector.java:127)
at
org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:130)
at
org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:38)
at
org.apache.mahout.clustering.dirichlet.DirichletState.adjustedProbability(DirichletState.java:129)
at
org.apache.mahout.clustering.dirichlet.DirichletClusterer.assignToModel(DirichletClusterer.java:256)
at
org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:47)
at
org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:41)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
This is the method for dirichlet clustering:
private void dirichletClustering(Path vectorPath,DistanceMeasure
measure, int numClusters, int maxIterations, double alpha0) throws
Exception {
boolean runSequential = false;
Configuration conf = new Configuration();
int prototypeSize = 10;
boolean emitMostLikely = true;
double threshold = 0.1;
String modelPrototype =
"org.apache.mahout.math.RandomAccessSparseVector";
String modelFactory =
"org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution";
Path clusterPath = new Path(outputDir,
"dirichletClustering-c"+numClusters+"-alpha"+alpha0+"-"+measure.getClass().getSimpleName());
HadoopUtil.overwriteOutput(clusterPath);
Path clusterPointsPath = new Path(clusterPath,
AbstractCluster.CLUSTERED_POINTS_DIR);
AbstractVectorModelDistribution modelDistribution =
DirichletDriver.createModelDistribution(modelFactory, modelPrototype,
measure.getClass().getName(), prototypeSize);
Path resultPath = DirichletDriver.buildClusters(conf,
vectorPath, clusterPath, modelDistribution, numClusters, maxIterations,
alpha0, runSequential);
DirichletDriver.clusterData(conf, vectorPath, clusterPath,
clusterPointsPath, emitMostLikely, threshold, runSequential);
}
The vectors are created this way:
private void generateVectors() throws Exception {
int minSupport = 2;
int maxNGramSize = 2;
float minLLRValue = 50;
float normPower = 2;
boolean logNormalize = false;
int chunkSizeInMegabytes = 64;
int numReducers = 1;
boolean sequentialAccessOutput = false;
boolean namedVectors = true;
Configuration conf = new Configuration();
String tokenizedDir =
preparePath.toString()+"/"+DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER;
Path tokenizedPath = new Path(tokenizedDir);
HadoopUtil.overwriteOutput(preparePath);
DocumentProcessor.tokenizeDocuments(inputPath,
DefaultAnalyzer.class, tokenizedPath);
DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
preparePath, conf, minSupport, maxNGramSize,
minLLRValue, normPower, logNormalize, numReducers, chunkSizeInMegabytes,
sequentialAccessOutput, namedVectors);
}
I already used this tf vectors as input for kmeans and fuzzykmeans, so
whats wrong?
Thanks!
Re: Dirichlet Clustering failed
Posted by Federico Castanedo <fc...@inf.uc3m.es>.
Hi Sascha,
What is the size of your input vectors?...10672?
I think that you need to need to use a prototype size of 10672 instead of 10
HTW
2010/11/26 Sascha Nordquist <sa...@nordquist.de>:
> Hi,
>
> when I run Dirichlet Clustering I get the following Exception:
>
> org.apache.mahout.math.CardinalityException: Required cardinality 10672 but
> got 10
> at
> org.apache.mahout.math.RandomAccessSparseVector.dot(RandomAccessSparseVector.java:172)
> at org.apache.mahout.math.NamedVector.dot(NamedVector.java:127)
> at
> org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:130)
> at
> org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:38)
> at
> org.apache.mahout.clustering.dirichlet.DirichletState.adjustedProbability(DirichletState.java:129)
> at
> org.apache.mahout.clustering.dirichlet.DirichletClusterer.assignToModel(DirichletClusterer.java:256)
> at
> org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:47)
> at
> org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:41)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
>
>
> This is the method for dirichlet clustering:
>
> private void dirichletClustering(Path vectorPath,DistanceMeasure measure,
> int numClusters, int maxIterations, double alpha0) throws Exception {
> boolean runSequential = false;
> Configuration conf = new Configuration();
> int prototypeSize = 10;
> boolean emitMostLikely = true;
> double threshold = 0.1;
> String modelPrototype =
> "org.apache.mahout.math.RandomAccessSparseVector";
> String modelFactory =
> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution";
> Path clusterPath = new Path(outputDir,
> "dirichletClustering-c"+numClusters+"-alpha"+alpha0+"-"+measure.getClass().getSimpleName());
> HadoopUtil.overwriteOutput(clusterPath);
> Path clusterPointsPath = new Path(clusterPath,
> AbstractCluster.CLUSTERED_POINTS_DIR);
> AbstractVectorModelDistribution modelDistribution =
> DirichletDriver.createModelDistribution(modelFactory, modelPrototype,
> measure.getClass().getName(), prototypeSize);
> Path resultPath = DirichletDriver.buildClusters(conf, vectorPath,
> clusterPath, modelDistribution, numClusters, maxIterations, alpha0,
> runSequential);
> DirichletDriver.clusterData(conf, vectorPath, clusterPath,
> clusterPointsPath, emitMostLikely, threshold, runSequential);
> }
>
> The vectors are created this way:
>
> private void generateVectors() throws Exception {
> int minSupport = 2;
> int maxNGramSize = 2;
> float minLLRValue = 50;
> float normPower = 2;
> boolean logNormalize = false;
> int chunkSizeInMegabytes = 64;
> int numReducers = 1;
> boolean sequentialAccessOutput = false;
> boolean namedVectors = true;
>
> Configuration conf = new Configuration();
> String tokenizedDir =
> preparePath.toString()+"/"+DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER;
> Path tokenizedPath = new Path(tokenizedDir);
> HadoopUtil.overwriteOutput(preparePath);
> DocumentProcessor.tokenizeDocuments(inputPath, DefaultAnalyzer.class,
> tokenizedPath);
>
> DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
> preparePath, conf, minSupport, maxNGramSize, minLLRValue,
> normPower, logNormalize, numReducers, chunkSizeInMegabytes,
> sequentialAccessOutput, namedVectors);
> }
>
>
> I already used this tf vectors as input for kmeans and fuzzykmeans, so whats
> wrong?
>
> Thanks!
>