You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by eric skinner <er...@gmail.com> on 2011/08/10 17:32:36 UTC

issues on Mahout clustering result using K-means

I ran the K-means clustering algorithm against a set of sequence files.
However, the generated result looks like this:

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

Would you like to let me know why I get this type of result? Is that because
of any specific parameter setting requirement or anything else?

The program I use is borrowed from NewsKMeansClustering.java, an example
given in chapter 9 of Mahout-in-Action.

The core clustering code in this program is

CanopyDriver.run(vectorsFolder, canopyCentroids, new
EuclideanDistanceMeasure(), 250,    120, false, false);

KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids, "clusters-0"),
clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true, false);

RE: issues on Mahout clustering result using K-means

Posted by Jeff Eastman <je...@Narus.com>.

It looks to me like all of your data points are sparse and empty. Check your input vectors for nonzero values :)

-----Original Message-----
From: surf reta [mailto:surfreta@gmail.com] 
Sent: Wednesday, August 10, 2011 2:05 PM
To: dev@mahout.apache.org
Subject: Re: issues on Mahout clustering result using K-means

Hi Jeff,

with respect to the clusterdump result for K-means-generated clusters, I get
sth like

VL-0{n=100 c=[] r=[]}
        Weight:  Point:
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
With respect to the clusterdump result for canopyCentroids/cluster-0, I get
sth like

C-0{n=1 c=[] r=[]}
        Weight:  Point:
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []

I am really confusing about the physical meanings of these results.

Thanks.

On Wed, Aug 10, 2011 at 12:31 PM, Jeff Eastman <je...@narus.com> wrote:

> Run clusterdump -s canopyCentroids/clusters-0. Generally, Mahout arguments
> are directories full of part-n files. You can also run clusterdump -s
> clusterOutput/clusters-n -p .../clusteredPoints after KMeans to see the
> results of your clustering. Argument 'n' would be the last iteration number.
>
> -----Original Message-----
> From: surf reta [mailto:surfreta@gmail.com]
> Sent: Wednesday, August 10, 2011 9:19 AM
> To: dev@mahout.apache.org
> Subject: Re: issues on Mahout clustering result using K-means
>
> Hi Jeff,
>
> I frist transferred a set of text files into sequence files through a
> customized program as follows. This program uses the Mahout utility of
> SequenceFilesFromDriectory
>
> public class TestSequenceFileConverter {
>
>    public static void main(String args[]){
>
>        String inputDir = "testdataset";
>        String outputDir = "sequenceInputDir";
>        try{SequenceFilesFromDirectory.main(new String[] {"--input",
>                inputDir.toString(), "--output", outputDir.toString(),
> "--chunkSize",
>                "64", "--charset",Charsets.UTF_8.name()});}
>        catch(Exception e){System.out.println("");}
>
>        }
>
> }
>
>
> Then I ran the K-means program, borrowed from NewsKMeansClustering, an
> example program given in Mahout-in-Action, to run against these generated
> sequence files.
>
> I just checked the generated clusters-0 directory, it has a file called
> part-r-00000. How can I read this file and get the useful information from
> it? Thanks.
>
> The NewsKMeansClustering is listed here for your reference:*
> *
>
> public class NewsKMeansClustering {
>
>  public static void main(String args[]) throws Exception {
>
>    int minSupport = 5;
>    int minDf = 5;
>    int maxDFPercent = 95;
>    int maxNGramSize = 2;
>    int minLLRValue = 50;
>    int reduceTasks = 1;
>    int chunkSize = 200;
>    int norm = 2;
>    boolean sequentialAccessOutput = true;
>
>  //  String inputDir = "inputDir";
>
>    String inputDir = "sequenceInputDir";
>
>    Configuration conf = new Configuration();
>    FileSystem fs = FileSystem.get(conf);
>    /*
>     * SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, new
> Path(inputDir, "documents.seq"),
>     * Text.class, Text.class); for (Document d : Database) {
> writer.append(new Text(d.getID()), new
>     * Text(d.contents())); } writer.close();
>     */
>
>    String outputDir = "newsClusters";
>    HadoopUtil.delete(conf, new Path(outputDir));
>    Path tokenizedPath = new Path(outputDir,
>        DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
>    MyAnalyzer analyzer = new MyAnalyzer();
>    DocumentProcessor.tokenizeDocuments(new Path(inputDir),
> analyzer.getClass()
>        .asSubclass(Analyzer.class), tokenizedPath, conf);
>
>    DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
>      new Path(outputDir), conf, minSupport, maxNGramSize, minLLRValue, 2,
> true, reduceTasks,
>      chunkSize, sequentialAccessOutput, false);
>    TFIDFConverter.processTfIdf(
>      new Path(outputDir ,
> DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
>      new Path(outputDir), conf, chunkSize, minDf,
>      maxDFPercent, norm, true, sequentialAccessOutput, false, reduceTasks);
>    Path vectorsFolder = new Path(outputDir, "tfidf-vectors");
>    Path canopyCentroids = new Path(outputDir , "canopy-centroids");
>    Path clusterOutput = new Path(outputDir , "clusters");
>
>    CanopyDriver.run(vectorsFolder, canopyCentroids,
>      new EuclideanDistanceMeasure(), 250, 120, false, false);
>    KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
> "clusters-0"),
>      clusterOutput, new TanimotoDistanceMeasure(), 0.01,
>      20, true, false);
>
>    SequenceFile.Reader reader = new SequenceFile.Reader(fs,
>   new Path(clusterOutput+"/" + Cluster.CLUSTERED_POINTS_DIR +
> "/part-m-00000"), conf);
>  // new Path(clusterOutput+"/clusteredPoints"+"/part-m-00000"),conf);
>
>    IntWritable key = new IntWritable();
>    WeightedVectorWritable value = new WeightedVectorWritable();
>    while (reader.next(key, value)) {
>       System.out.println(key.toString() + " belongs to cluster "
>       + value.toString());
>    }
>    reader.close();
>  }
> }
>
>
>
> On Wed, Aug 10, 2011 at 11:40 AM, Jeff Eastman <je...@narus.com> wrote:
>
> > What do your input vectors look like?
> > How many canopies did you get in clusters-0?
> >
> > -----Original Message-----
> > From: eric skinner [mailto:ericfrankskinner@gmail.com]
> > Sent: Wednesday, August 10, 2011 8:33 AM
> > To: dev@mahout.apache.org
> > Subject: issues on Mahout clustering result using K-means
> >
> > I ran the K-means clustering algorithm against a set of sequence files.
> > However, the generated result looks like this:
> >
> > 0 belongs to cluster 1.0: []
> >
> > 0 belongs to cluster 1.0: []
> >
> > 0 belongs to cluster 1.0: []
> >
> > 0 belongs to cluster 1.0: []
> >
> > 0 belongs to cluster 1.0: []
> >
> > 0 belongs to cluster 1.0: []
> >
> > Would you like to let me know why I get this type of result? Is that
> > because
> > of any specific parameter setting requirement or anything else?
> >
> > The program I use is borrowed from NewsKMeansClustering.java, an example
> > given in chapter 9 of Mahout-in-Action.
> >
> > The core clustering code in this program is
> >
> > CanopyDriver.run(vectorsFolder, canopyCentroids, new
> > EuclideanDistanceMeasure(), 250,    120, false, false);
> >
> > KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
> > "clusters-0"),
> > clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true, false);
> >
>

Re: issues on Mahout clustering result using K-means

Posted by surf reta <su...@gmail.com>.

Hi Jeff,

with respect to the clusterdump result for K-means-generated clusters, I get
sth like

VL-0{n=100 c=[] r=[]}
        Weight:  Point:
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
With respect to the clusterdump result for canopyCentroids/cluster-0, I get
sth like

C-0{n=1 c=[] r=[]}
        Weight:  Point:
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []
        1.0: []

I am really confusing about the physical meanings of these results.

Thanks.

On Wed, Aug 10, 2011 at 12:31 PM, Jeff Eastman <je...@narus.com> wrote:

> Run clusterdump -s canopyCentroids/clusters-0. Generally, Mahout arguments
> are directories full of part-n files. You can also run clusterdump -s
> clusterOutput/clusters-n -p .../clusteredPoints after KMeans to see the
> results of your clustering. Argument 'n' would be the last iteration number.
>
> -----Original Message-----
> From: surf reta [mailto:surfreta@gmail.com]
> Sent: Wednesday, August 10, 2011 9:19 AM
> To: dev@mahout.apache.org
> Subject: Re: issues on Mahout clustering result using K-means
>
> Hi Jeff,
>
> I frist transferred a set of text files into sequence files through a
> customized program as follows. This program uses the Mahout utility of
> SequenceFilesFromDriectory
>
> public class TestSequenceFileConverter {
>
>    public static void main(String args[]){
>
>        String inputDir = "testdataset";
>        String outputDir = "sequenceInputDir";
>        try{SequenceFilesFromDirectory.main(new String[] {"--input",
>                inputDir.toString(), "--output", outputDir.toString(),
> "--chunkSize",
>                "64", "--charset",Charsets.UTF_8.name()});}
>        catch(Exception e){System.out.println("");}
>
>        }
>
> }
>
>
> Then I ran the K-means program, borrowed from NewsKMeansClustering, an
> example program given in Mahout-in-Action, to run against these generated
> sequence files.
>
> I just checked the generated clusters-0 directory, it has a file called
> part-r-00000. How can I read this file and get the useful information from
> it? Thanks.
>
> The NewsKMeansClustering is listed here for your reference:*
> *
>
> public class NewsKMeansClustering {
>
>  public static void main(String args[]) throws Exception {
>
>    int minSupport = 5;
>    int minDf = 5;
>    int maxDFPercent = 95;
>    int maxNGramSize = 2;
>    int minLLRValue = 50;
>    int reduceTasks = 1;
>    int chunkSize = 200;
>    int norm = 2;
>    boolean sequentialAccessOutput = true;
>
>  //  String inputDir = "inputDir";
>
>    String inputDir = "sequenceInputDir";
>
>    Configuration conf = new Configuration();
>    FileSystem fs = FileSystem.get(conf);
>    /*
>     * SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, new
> Path(inputDir, "documents.seq"),
>     * Text.class, Text.class); for (Document d : Database) {
> writer.append(new Text(d.getID()), new
>     * Text(d.contents())); } writer.close();
>     */
>
>    String outputDir = "newsClusters";
>    HadoopUtil.delete(conf, new Path(outputDir));
>    Path tokenizedPath = new Path(outputDir,
>        DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
>    MyAnalyzer analyzer = new MyAnalyzer();
>    DocumentProcessor.tokenizeDocuments(new Path(inputDir),
> analyzer.getClass()
>        .asSubclass(Analyzer.class), tokenizedPath, conf);
>
>    DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
>      new Path(outputDir), conf, minSupport, maxNGramSize, minLLRValue, 2,
> true, reduceTasks,
>      chunkSize, sequentialAccessOutput, false);
>    TFIDFConverter.processTfIdf(
>      new Path(outputDir ,
> DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
>      new Path(outputDir), conf, chunkSize, minDf,
>      maxDFPercent, norm, true, sequentialAccessOutput, false, reduceTasks);
>    Path vectorsFolder = new Path(outputDir, "tfidf-vectors");
>    Path canopyCentroids = new Path(outputDir , "canopy-centroids");
>    Path clusterOutput = new Path(outputDir , "clusters");
>
>    CanopyDriver.run(vectorsFolder, canopyCentroids,
>      new EuclideanDistanceMeasure(), 250, 120, false, false);
>    KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
> "clusters-0"),
>      clusterOutput, new TanimotoDistanceMeasure(), 0.01,
>      20, true, false);
>
>    SequenceFile.Reader reader = new SequenceFile.Reader(fs,
>   new Path(clusterOutput+"/" + Cluster.CLUSTERED_POINTS_DIR +
> "/part-m-00000"), conf);
>  // new Path(clusterOutput+"/clusteredPoints"+"/part-m-00000"),conf);
>
>    IntWritable key = new IntWritable();
>    WeightedVectorWritable value = new WeightedVectorWritable();
>    while (reader.next(key, value)) {
>       System.out.println(key.toString() + " belongs to cluster "
>       + value.toString());
>    }
>    reader.close();
>  }
> }
>
>
>
> On Wed, Aug 10, 2011 at 11:40 AM, Jeff Eastman <je...@narus.com> wrote:
>
> > What do your input vectors look like?
> > How many canopies did you get in clusters-0?
> >
> > -----Original Message-----
> > From: eric skinner [mailto:ericfrankskinner@gmail.com]
> > Sent: Wednesday, August 10, 2011 8:33 AM
> > To: dev@mahout.apache.org
> > Subject: issues on Mahout clustering result using K-means
> >
> > I ran the K-means clustering algorithm against a set of sequence files.
> > However, the generated result looks like this:
> >
> > 0 belongs to cluster 1.0: []
> >
> > 0 belongs to cluster 1.0: []
> >
> > 0 belongs to cluster 1.0: []
> >
> > 0 belongs to cluster 1.0: []
> >
> > 0 belongs to cluster 1.0: []
> >
> > 0 belongs to cluster 1.0: []
> >
> > Would you like to let me know why I get this type of result? Is that
> > because
> > of any specific parameter setting requirement or anything else?
> >
> > The program I use is borrowed from NewsKMeansClustering.java, an example
> > given in chapter 9 of Mahout-in-Action.
> >
> > The core clustering code in this program is
> >
> > CanopyDriver.run(vectorsFolder, canopyCentroids, new
> > EuclideanDistanceMeasure(), 250,    120, false, false);
> >
> > KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
> > "clusters-0"),
> > clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true, false);
> >
>

RE: issues on Mahout clustering result using K-means

Posted by Jeff Eastman <je...@Narus.com>.

Run clusterdump -s canopyCentroids/clusters-0. Generally, Mahout arguments are directories full of part-n files. You can also run clusterdump -s clusterOutput/clusters-n -p .../clusteredPoints after KMeans to see the results of your clustering. Argument 'n' would be the last iteration number.

-----Original Message-----
From: surf reta [mailto:surfreta@gmail.com] 
Sent: Wednesday, August 10, 2011 9:19 AM
To: dev@mahout.apache.org
Subject: Re: issues on Mahout clustering result using K-means

Hi Jeff,

I frist transferred a set of text files into sequence files through a
customized program as follows. This program uses the Mahout utility of
SequenceFilesFromDriectory

public class TestSequenceFileConverter {

    public static void main(String args[]){

        String inputDir = "testdataset";
        String outputDir = "sequenceInputDir";
        try{SequenceFilesFromDirectory.main(new String[] {"--input",
                inputDir.toString(), "--output", outputDir.toString(),
"--chunkSize",
                "64", "--charset",Charsets.UTF_8.name()});}
        catch(Exception e){System.out.println("");}

        }

}


Then I ran the K-means program, borrowed from NewsKMeansClustering, an
example program given in Mahout-in-Action, to run against these generated
sequence files.

I just checked the generated clusters-0 directory, it has a file called
part-r-00000. How can I read this file and get the useful information from
it? Thanks.

The NewsKMeansClustering is listed here for your reference:*
*

public class NewsKMeansClustering {

  public static void main(String args[]) throws Exception {

    int minSupport = 5;
    int minDf = 5;
    int maxDFPercent = 95;
    int maxNGramSize = 2;
    int minLLRValue = 50;
    int reduceTasks = 1;
    int chunkSize = 200;
    int norm = 2;
    boolean sequentialAccessOutput = true;

  //  String inputDir = "inputDir";

    String inputDir = "sequenceInputDir";

    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(conf);
    /*
     * SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, new
Path(inputDir, "documents.seq"),
     * Text.class, Text.class); for (Document d : Database) {
writer.append(new Text(d.getID()), new
     * Text(d.contents())); } writer.close();
     */

    String outputDir = "newsClusters";
    HadoopUtil.delete(conf, new Path(outputDir));
    Path tokenizedPath = new Path(outputDir,
        DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
    MyAnalyzer analyzer = new MyAnalyzer();
    DocumentProcessor.tokenizeDocuments(new Path(inputDir),
analyzer.getClass()
        .asSubclass(Analyzer.class), tokenizedPath, conf);

    DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
      new Path(outputDir), conf, minSupport, maxNGramSize, minLLRValue, 2,
true, reduceTasks,
      chunkSize, sequentialAccessOutput, false);
    TFIDFConverter.processTfIdf(
      new Path(outputDir ,
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
      new Path(outputDir), conf, chunkSize, minDf,
      maxDFPercent, norm, true, sequentialAccessOutput, false, reduceTasks);
    Path vectorsFolder = new Path(outputDir, "tfidf-vectors");
    Path canopyCentroids = new Path(outputDir , "canopy-centroids");
    Path clusterOutput = new Path(outputDir , "clusters");

    CanopyDriver.run(vectorsFolder, canopyCentroids,
      new EuclideanDistanceMeasure(), 250, 120, false, false);
    KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
"clusters-0"),
      clusterOutput, new TanimotoDistanceMeasure(), 0.01,
      20, true, false);

    SequenceFile.Reader reader = new SequenceFile.Reader(fs,
   new Path(clusterOutput+"/" + Cluster.CLUSTERED_POINTS_DIR +
"/part-m-00000"), conf);
  // new Path(clusterOutput+"/clusteredPoints"+"/part-m-00000"),conf);

    IntWritable key = new IntWritable();
    WeightedVectorWritable value = new WeightedVectorWritable();
    while (reader.next(key, value)) {
       System.out.println(key.toString() + " belongs to cluster "
       + value.toString());
    }
    reader.close();
  }
}



On Wed, Aug 10, 2011 at 11:40 AM, Jeff Eastman <je...@narus.com> wrote:

> What do your input vectors look like?
> How many canopies did you get in clusters-0?
>
> -----Original Message-----
> From: eric skinner [mailto:ericfrankskinner@gmail.com]
> Sent: Wednesday, August 10, 2011 8:33 AM
> To: dev@mahout.apache.org
> Subject: issues on Mahout clustering result using K-means
>
> I ran the K-means clustering algorithm against a set of sequence files.
> However, the generated result looks like this:
>
> 0 belongs to cluster 1.0: []
>
> 0 belongs to cluster 1.0: []
>
> 0 belongs to cluster 1.0: []
>
> 0 belongs to cluster 1.0: []
>
> 0 belongs to cluster 1.0: []
>
> 0 belongs to cluster 1.0: []
>
> Would you like to let me know why I get this type of result? Is that
> because
> of any specific parameter setting requirement or anything else?
>
> The program I use is borrowed from NewsKMeansClustering.java, an example
> given in chapter 9 of Mahout-in-Action.
>
> The core clustering code in this program is
>
> CanopyDriver.run(vectorsFolder, canopyCentroids, new
> EuclideanDistanceMeasure(), 250,    120, false, false);
>
> KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
> "clusters-0"),
> clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true, false);
>

Re: issues on Mahout clustering result using K-means

Posted by surf reta <su...@gmail.com>.

Hi Jeff,

I frist transferred a set of text files into sequence files through a
customized program as follows. This program uses the Mahout utility of
SequenceFilesFromDriectory

public class TestSequenceFileConverter {

    public static void main(String args[]){

        String inputDir = "testdataset";
        String outputDir = "sequenceInputDir";
        try{SequenceFilesFromDirectory.main(new String[] {"--input",
                inputDir.toString(), "--output", outputDir.toString(),
"--chunkSize",
                "64", "--charset",Charsets.UTF_8.name()});}
        catch(Exception e){System.out.println("");}

        }

}


Then I ran the K-means program, borrowed from NewsKMeansClustering, an
example program given in Mahout-in-Action, to run against these generated
sequence files.

I just checked the generated clusters-0 directory, it has a file called
part-r-00000. How can I read this file and get the useful information from
it? Thanks.

The NewsKMeansClustering is listed here for your reference:*
*

public class NewsKMeansClustering {

  public static void main(String args[]) throws Exception {

    int minSupport = 5;
    int minDf = 5;
    int maxDFPercent = 95;
    int maxNGramSize = 2;
    int minLLRValue = 50;
    int reduceTasks = 1;
    int chunkSize = 200;
    int norm = 2;
    boolean sequentialAccessOutput = true;

  //  String inputDir = "inputDir";

    String inputDir = "sequenceInputDir";

    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(conf);
    /*
     * SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, new
Path(inputDir, "documents.seq"),
     * Text.class, Text.class); for (Document d : Database) {
writer.append(new Text(d.getID()), new
     * Text(d.contents())); } writer.close();
     */

    String outputDir = "newsClusters";
    HadoopUtil.delete(conf, new Path(outputDir));
    Path tokenizedPath = new Path(outputDir,
        DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
    MyAnalyzer analyzer = new MyAnalyzer();
    DocumentProcessor.tokenizeDocuments(new Path(inputDir),
analyzer.getClass()
        .asSubclass(Analyzer.class), tokenizedPath, conf);

    DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
      new Path(outputDir), conf, minSupport, maxNGramSize, minLLRValue, 2,
true, reduceTasks,
      chunkSize, sequentialAccessOutput, false);
    TFIDFConverter.processTfIdf(
      new Path(outputDir ,
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
      new Path(outputDir), conf, chunkSize, minDf,
      maxDFPercent, norm, true, sequentialAccessOutput, false, reduceTasks);
    Path vectorsFolder = new Path(outputDir, "tfidf-vectors");
    Path canopyCentroids = new Path(outputDir , "canopy-centroids");
    Path clusterOutput = new Path(outputDir , "clusters");

    CanopyDriver.run(vectorsFolder, canopyCentroids,
      new EuclideanDistanceMeasure(), 250, 120, false, false);
    KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
"clusters-0"),
      clusterOutput, new TanimotoDistanceMeasure(), 0.01,
      20, true, false);

    SequenceFile.Reader reader = new SequenceFile.Reader(fs,
   new Path(clusterOutput+"/" + Cluster.CLUSTERED_POINTS_DIR +
"/part-m-00000"), conf);
  // new Path(clusterOutput+"/clusteredPoints"+"/part-m-00000"),conf);

    IntWritable key = new IntWritable();
    WeightedVectorWritable value = new WeightedVectorWritable();
    while (reader.next(key, value)) {
       System.out.println(key.toString() + " belongs to cluster "
       + value.toString());
    }
    reader.close();
  }
}



On Wed, Aug 10, 2011 at 11:40 AM, Jeff Eastman <je...@narus.com> wrote:

> What do your input vectors look like?
> How many canopies did you get in clusters-0?
>
> -----Original Message-----
> From: eric skinner [mailto:ericfrankskinner@gmail.com]
> Sent: Wednesday, August 10, 2011 8:33 AM
> To: dev@mahout.apache.org
> Subject: issues on Mahout clustering result using K-means
>
> I ran the K-means clustering algorithm against a set of sequence files.
> However, the generated result looks like this:
>
> 0 belongs to cluster 1.0: []
>
> 0 belongs to cluster 1.0: []
>
> 0 belongs to cluster 1.0: []
>
> 0 belongs to cluster 1.0: []
>
> 0 belongs to cluster 1.0: []
>
> 0 belongs to cluster 1.0: []
>
> Would you like to let me know why I get this type of result? Is that
> because
> of any specific parameter setting requirement or anything else?
>
> The program I use is borrowed from NewsKMeansClustering.java, an example
> given in chapter 9 of Mahout-in-Action.
>
> The core clustering code in this program is
>
> CanopyDriver.run(vectorsFolder, canopyCentroids, new
> EuclideanDistanceMeasure(), 250,    120, false, false);
>
> KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
> "clusters-0"),
> clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true, false);
>

RE: issues on Mahout clustering result using K-means

Posted by Jeff Eastman <je...@Narus.com>.

What do your input vectors look like?
How many canopies did you get in clusters-0?

-----Original Message-----
From: eric skinner [mailto:ericfrankskinner@gmail.com] 
Sent: Wednesday, August 10, 2011 8:33 AM
To: dev@mahout.apache.org
Subject: issues on Mahout clustering result using K-means

I ran the K-means clustering algorithm against a set of sequence files.
However, the generated result looks like this:

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

0 belongs to cluster 1.0: []

Would you like to let me know why I get this type of result? Is that because
of any specific parameter setting requirement or anything else?

The program I use is borrowed from NewsKMeansClustering.java, an example
given in chapter 9 of Mahout-in-Action.

The core clustering code in this program is

CanopyDriver.run(vectorsFolder, canopyCentroids, new
EuclideanDistanceMeasure(), 250,    120, false, false);

KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids, "clusters-0"),
clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true, false);