You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sascha Nordquist <sa...@nordquist.de> on 2010/11/26 18:53:05 UTC

Dirichlet Clustering failed

Hi,

when I run Dirichlet Clustering I get the following Exception:

org.apache.mahout.math.CardinalityException: Required cardinality 10672 
but got 10
     at 
org.apache.mahout.math.RandomAccessSparseVector.dot(RandomAccessSparseVector.java:172)
     at org.apache.mahout.math.NamedVector.dot(NamedVector.java:127)
     at 
org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:130)
     at 
org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:38)
     at 
org.apache.mahout.clustering.dirichlet.DirichletState.adjustedProbability(DirichletState.java:129)
     at 
org.apache.mahout.clustering.dirichlet.DirichletClusterer.assignToModel(DirichletClusterer.java:256)
     at 
org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:47)
     at 
org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:41)
     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
     at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)


This is the method for dirichlet clustering:

     private void dirichletClustering(Path vectorPath,DistanceMeasure 
measure, int numClusters, int maxIterations, double alpha0) throws 
Exception {
         boolean runSequential = false;
         Configuration conf = new Configuration();
         int prototypeSize = 10;
         boolean emitMostLikely = true;
         double threshold = 0.1;
         String modelPrototype = 
"org.apache.mahout.math.RandomAccessSparseVector";
         String modelFactory = 
"org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution";
         Path clusterPath = new Path(outputDir, 
"dirichletClustering-c"+numClusters+"-alpha"+alpha0+"-"+measure.getClass().getSimpleName());
         HadoopUtil.overwriteOutput(clusterPath);
         Path clusterPointsPath = new Path(clusterPath, 
AbstractCluster.CLUSTERED_POINTS_DIR);
         AbstractVectorModelDistribution modelDistribution = 
DirichletDriver.createModelDistribution(modelFactory, modelPrototype, 
measure.getClass().getName(), prototypeSize);
         Path resultPath = DirichletDriver.buildClusters(conf, 
vectorPath, clusterPath, modelDistribution, numClusters, maxIterations, 
alpha0, runSequential);
         DirichletDriver.clusterData(conf, vectorPath, clusterPath, 
clusterPointsPath, emitMostLikely, threshold, runSequential);
     }

The vectors are created this way:

     private void generateVectors() throws Exception {
         int minSupport = 2;
         int maxNGramSize = 2;
         float minLLRValue = 50;
         float normPower = 2;
         boolean logNormalize = false;
         int chunkSizeInMegabytes = 64;
         int numReducers = 1;
         boolean sequentialAccessOutput = false;
         boolean namedVectors = true;

         Configuration conf = new Configuration();
         String tokenizedDir = 
preparePath.toString()+"/"+DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER;
         Path tokenizedPath = new Path(tokenizedDir);
         HadoopUtil.overwriteOutput(preparePath);
         DocumentProcessor.tokenizeDocuments(inputPath, 
DefaultAnalyzer.class, tokenizedPath);

         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
                 preparePath, conf, minSupport, maxNGramSize, 
minLLRValue, normPower, logNormalize, numReducers, chunkSizeInMegabytes, 
sequentialAccessOutput, namedVectors);
     }


I already used this tf vectors as input for kmeans and fuzzykmeans, so 
whats wrong?

Thanks!

Re: Dirichlet Clustering failed

Posted by Federico Castanedo <fc...@inf.uc3m.es>.
Hi Sascha,

What is the size of your input vectors?...10672?
I think that you need to need to use a prototype size of 10672 instead of 10
HTW

2010/11/26 Sascha Nordquist <sa...@nordquist.de>:
> Hi,
>
> when I run Dirichlet Clustering I get the following Exception:
>
> org.apache.mahout.math.CardinalityException: Required cardinality 10672 but
> got 10
>    at
> org.apache.mahout.math.RandomAccessSparseVector.dot(RandomAccessSparseVector.java:172)
>    at org.apache.mahout.math.NamedVector.dot(NamedVector.java:127)
>    at
> org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:130)
>    at
> org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:38)
>    at
> org.apache.mahout.clustering.dirichlet.DirichletState.adjustedProbability(DirichletState.java:129)
>    at
> org.apache.mahout.clustering.dirichlet.DirichletClusterer.assignToModel(DirichletClusterer.java:256)
>    at
> org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:47)
>    at
> org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:41)
>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>    at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
>
>
> This is the method for dirichlet clustering:
>
>    private void dirichletClustering(Path vectorPath,DistanceMeasure measure,
> int numClusters, int maxIterations, double alpha0) throws Exception {
>        boolean runSequential = false;
>        Configuration conf = new Configuration();
>        int prototypeSize = 10;
>        boolean emitMostLikely = true;
>        double threshold = 0.1;
>        String modelPrototype =
> "org.apache.mahout.math.RandomAccessSparseVector";
>        String modelFactory =
> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution";
>        Path clusterPath = new Path(outputDir,
> "dirichletClustering-c"+numClusters+"-alpha"+alpha0+"-"+measure.getClass().getSimpleName());
>        HadoopUtil.overwriteOutput(clusterPath);
>        Path clusterPointsPath = new Path(clusterPath,
> AbstractCluster.CLUSTERED_POINTS_DIR);
>        AbstractVectorModelDistribution modelDistribution =
> DirichletDriver.createModelDistribution(modelFactory, modelPrototype,
> measure.getClass().getName(), prototypeSize);
>        Path resultPath = DirichletDriver.buildClusters(conf, vectorPath,
> clusterPath, modelDistribution, numClusters, maxIterations, alpha0,
> runSequential);
>        DirichletDriver.clusterData(conf, vectorPath, clusterPath,
> clusterPointsPath, emitMostLikely, threshold, runSequential);
>    }
>
> The vectors are created this way:
>
>    private void generateVectors() throws Exception {
>        int minSupport = 2;
>        int maxNGramSize = 2;
>        float minLLRValue = 50;
>        float normPower = 2;
>        boolean logNormalize = false;
>        int chunkSizeInMegabytes = 64;
>        int numReducers = 1;
>        boolean sequentialAccessOutput = false;
>        boolean namedVectors = true;
>
>        Configuration conf = new Configuration();
>        String tokenizedDir =
> preparePath.toString()+"/"+DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER;
>        Path tokenizedPath = new Path(tokenizedDir);
>        HadoopUtil.overwriteOutput(preparePath);
>        DocumentProcessor.tokenizeDocuments(inputPath, DefaultAnalyzer.class,
> tokenizedPath);
>
>        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
>                preparePath, conf, minSupport, maxNGramSize, minLLRValue,
> normPower, logNormalize, numReducers, chunkSizeInMegabytes,
> sequentialAccessOutput, namedVectors);
>    }
>
>
> I already used this tf vectors as input for kmeans and fuzzykmeans, so whats
> wrong?
>
> Thanks!
>