You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Alfred Dimaunahan <al...@fbmsoftware.com> on 2011/03/16 09:59:14 UTC

Fuzzy K-Means Document score

I'm currently studying Mahout and would like to ask for some help on how to
understand some concepts.

Read this in Mahout in Action (p. 131, section 9.3.3):
===
They Fuzzy K-Means algorithm gave us a way to refine the related articles
code. Now we know, by what degree a point belongs to a cluster. Using this
information we can find top clusters the point belongs to and use the degree
to find the weighted score of articles. This way we negate the strictness of
overlapping clustering and give better related-articles for documents lying
on the boundaries of a cluster.
===

I've seen the output of Fuzzy K-Means: ClusterDumper can show the number of
clusters and the top terms associated with some numerical value.

I would like to know on what's the next step after that in order to identify
if document N's score for cluster 0, 1, 2, etc. Do i have to refer to a
sequence file (not sure which one to read, prolly
tokenized-documents/part-m-00000) then do some calculation to achieve the
relevancy of an article/document based on the provided scores of terms in
ClusterDumper per cluster?

Or is there a sequence file that already have the information on article ID
and score per topic/cluster?

Thanks for any leads on this.

-Alfred

Re: Fuzzy K-Means Document score

Posted by Alfred Dimaunahan <al...@fbmsoftware.com>.
Hi Robin,

Thanks for that info, and you're right, it is in clusteredPoints. Got the
idea from looking into ClusterDumper and TestClusterDumper.

I have another question though: need to understand why the output of one
approach (TestFuzzyKmeans1) is better than the other one (TestFuzzyKmeans2)
(please see attachments).

Using the same sample data (in TestFuzzyKmeans2 there are 15 text files
created and converted into sequence file), the output is different in terms
of weights and cluster assignment. I know the 2nd approach is the best
practice but Is there a way to port the logic of TestFuzzyKmeans1 to
something like TestFuzzyKmeans2? Or is TestFuzzyKmeans indeed a better and
simpler approach? I tried playing with the parameters in TestFuzzyKmeans2
but still, the output is inconsistent unlike with the result of
TestFuzzyKmeans1. My issue on TestFuzzyKmeans1 is its scalability (since
it's using memory instead of file).

Note that the codes are based on TestClusterdumper.java in Mahout's source
and NewsFuzzyKMeansClustering.java of Mahout in Action book.

Hoping for some pointers on how to improve the parameters and discussion on
why one approach is better. Thanks!

-Alfred

P.S.

I'm pasting the code here in case file attachment is not possible:
===
public class TestFuzzyKmeans1 {

    public static void main(String args[]) throws Exception {

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        // Create test data
        getSampleData(DOCS);

        writePointsToFile(sampleData, true, new Path("testdata/file1"), fs,
conf);

        DistanceMeasure measure = new EuclideanDistanceMeasure();

        // now run the Canopy job to prime kMeans canopies
        Path output = new Path("output");
        HadoopUtil.overwriteOutput(output);
        CanopyDriver.run(conf, new Path("testdata"), output, measure, 8, 4,
false, false);

        // now run the Fuzzy KMeans job
        FuzzyKMeansDriver.run(conf,
            new Path("testdata"),
            new Path(output, "clusters-0"),
            output,
            measure,
            0.001,
            10,
            ((float) 1.1),
            true,
            true,
            0,
            false);

        // run ClusterDumper
        ClusterDumper clusterDumper = new
ClusterDumper(finalClusterPath(conf, output, 10), new Path(output,
"clusteredPoints"));
        clusterDumper.printClusters(termDictionary);
    }

    private static void getSampleData(String[] docs2) throws IOException {
        sampleData = new ArrayList<VectorWritable>();
        RAMDirectory directory = new RAMDirectory();
        IndexWriter writer = new IndexWriter(directory,
                new StandardAnalyzer(Version.LUCENE_30),
                true,
                IndexWriter.MaxFieldLength.UNLIMITED);
        for (int i = 0; i < docs2.length; i++) {
            Document doc = new Document();
            Fieldable id = new Field("id", "doc_" + i, Field.Store.YES,
Field.Index.NOT_ANALYZED_NO_NORMS);
            doc.add(id);
            // Store both position and offset information
            Fieldable text = new Field("content", docs2[i], Field.Store.NO,
Field.Index.ANALYZED, Field.TermVector.YES);
            doc.add(text);
            writer.addDocument(doc);
        }
        writer.close();
        IndexReader reader = IndexReader.open(directory, true);
        Weight weight = new TFIDF();
        TermInfo termInfo = new CachedTermInfo(reader, "content", 1, 100);

        int numTerms = 0;
        for (Iterator<TermEntry> it = termInfo.getAllEntries();
it.hasNext();) {
            it.next();
            numTerms++;
        }
        termDictionary = new String[numTerms];
        int i = 0;
        for (Iterator<TermEntry> it = termInfo.getAllEntries();
it.hasNext();) {
            String term = it.next().term;
            termDictionary[i] = term;
            System.out.println(i + " " + term);
            i++;
        }
        VectorMapper mapper = new TFDFMapper(reader, weight, termInfo);
        Iterable<Vector> iterable = new LuceneIterable(reader, "id",
"content", mapper);

        i = 0;
        for (Vector vector : iterable) {
            NamedVector namedVector;
            if (vector instanceof NamedVector) {
                //rename it for testing purposes
                namedVector = new NamedVector(((NamedVector)
vector).getDelegate(), "P(" + i + ')');

            } else {
                namedVector = new NamedVector(vector, "P(" + i + ')');
            }
            System.out.println(AbstractCluster.formatVector(namedVector,
termDictionary));
            sampleData.add(new VectorWritable(namedVector));
            i++;
        }
    }

    public static void writePointsToFile(Iterable<VectorWritable> points,
        boolean intWritable,
        Path path,
        FileSystem fs,
        Configuration conf) throws IOException {

        SequenceFile.Writer writer = new SequenceFile.Writer(fs,
                conf,
                path,
                intWritable ? IntWritable.class : LongWritable.class,
                VectorWritable.class);
        int recNum = 0;
        for (VectorWritable point : points) {
            writer.append(intWritable ? new IntWritable(recNum++) : new
LongWritable(recNum++), point);
        }
        writer.close();
    }

    private static Path finalClusterPath(Configuration conf, Path output,
int maxIterations) throws IOException {
        FileSystem fs = FileSystem.get(conf);
        for (int i = maxIterations; i >= 0; i--) {
            Path clusters = new Path(output, "clusters-" + i);
            if (fs.exists(clusters)) {
                return clusters;
            }
        }
        return null;
    }

    private static final String[] DOCS = {
            "The quick red fox jumped over the lazy brown dogs.",
            "The quick red cat jumped over the lazy brown dogs.",
            "The quick brown cat jumped over the lazy red dogs.",
            "Mary had a little lamb whose fleece was white as snow.",
            "Mary had a little lamb whose fleece was black as tar.",
            "Dick had a little goat whose fleece was white as snow.",
            "Moby Dick is a story of a whale and a man obsessed.",
            "Moby Bob is a story of a walrus and a man obsessed.",
            "Moby Dick is a story of a whale and a crazy man.",
            "The robber wore a black fleece jacket and a baseball cap.",
            "The robber wore a red fleece jacket and a baseball cap.",
            "The robber wore a white fleece jacket and a baseball cap.",
            "The quick brown fox jumped over the lazy red dogs.",
            "Mary had a little goat whose fleece was white as snow.",
            "The English Springer Spaniel is the best of all dogs." };

    private static List<VectorWritable> sampleData;

    private static String[] termDictionary;
}
===

===
public class TestFuzzyKmeans2 {

    public static void main(String args[]) throws Exception {

        int minSupport = 2;
        int minDf = 1;
        int maxDFPercent = 70;
        int maxNGramSize = 1;
        int minLLRValue = 1;
        int reduceTasks = 1;
        int chunkSize = 100;
        int norm = 2;
        int numberOfClusters = 5;
        boolean sequentialAccessOutput = false;

        String inputDir = "inputDir"; // a sequence file

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);

        String outputDir = "newsClusters";
        HadoopUtil.overwriteOutput(new Path(outputDir));

        Path tokenizedPath = new Path(outputDir,
                DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
        MyAnalyzer analyzer = new MyAnalyzer();
//        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
        DocumentProcessor.tokenizeDocuments(new Path(inputDir),
analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath);

        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
            new Path(outputDir), conf, minSupport, maxNGramSize,
minLLRValue, norm, true, reduceTasks,
            chunkSize, sequentialAccessOutput, true);

        TFIDFConverter.processTfIdf(
                new Path(outputDir,
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
                new Path(outputDir), chunkSize, minDf,
                maxDFPercent, norm, true, sequentialAccessOutput, false,
reduceTasks);
        String vectorsFolder = outputDir + "/tf-vectors";
        String canopyCentroids = outputDir + "/canopy-centroids";
        String clusterOutput = outputDir + "/clusters/";

        DistanceMeasure measure = new EuclideanDistanceMeasure();
//        DistanceMeasure measure = new TanimotoDistanceMeasure();

        // using RandomSeedGenerator
//        Path canopyCentroidsPath = RandomSeedGenerator.buildRandom(new
Path(vectorsFolder), new Path(canopyCentroids), numberOfClusters, measure);
//        FuzzyKMeansDriver.run(conf,
//            new Path(vectorsFolder),
//            canopyCentroidsPath,
//            new Path(clusterOutput),
//            measure,
//            0.001,
//            10,
//            ((float) 1.1),
//            true,
//            true,
//            0.0,
//            false);

        // using CanopyDriver
        CanopyDriver.run(conf,
            new Path(vectorsFolder),
            new Path(canopyCentroids),
            measure,
            8,
            4,
            false,
            false);

        FuzzyKMeansDriver.run(conf,
            new Path(vectorsFolder),
            new Path(canopyCentroids, "clusters-0"),
            new Path(clusterOutput),
            measure,
            0.001,
            10,
            ((float) 1.1),
            true,
            true,
            0.0,
            false);

        // run ClusterDumper
        ClusterDumper clusterDumper = new
ClusterDumper(finalClusterPath(conf, new Path(clusterOutput), 1), new
Path(clusterOutput, "clusteredPoints"));
        clusterDumper.setTermDictionary(outputDir + "/dictionary.file-0",
"sequencefile");
        clusterDumper.printClusters(null);

    }

    private static Path finalClusterPath(Configuration conf, Path output,
int maxIterations) throws IOException {
        FileSystem fs = FileSystem.get(conf);
        for (int i = maxIterations; i >= 0; i--) {
            Path clusters = new Path(output, "clusters-" + i);
            if (fs.exists(clusters)) {
                return clusters;
            }
        }
        return null;
    }
}
===

On Thu, Mar 17, 2011 at 12:00 AM, Robin Anil <ro...@gmail.com> wrote:

> >
> >
> >
> > I've seen the output of Fuzzy K-Means: ClusterDumper can show the number
> of
> > clusters and the top terms associated with some numerical value.
> >
> > I would like to know on what's the next step after that in order to
> > identify
> > if document N's score for cluster 0, 1, 2, etc. Do i have to refer to a
> > sequence file (not sure which one to read, prolly
> > tokenized-documents/part-m-00000) then do some calculation to achieve the
> > relevancy of an article/document based on the provided scores of terms in
> > ClusterDumper per cluster?
>
> clustering algorithms have a flag to write output file with assignment
> between a document id and the cluster id. enabling this will generate a
> folder in the output named clusteredPoints, see the --help options to know
> what this flag is. You just need to dump the sequence file to understand
> what it is.
>
> >
>
>
> >
>

Re: Fuzzy K-Means Document score

Posted by Robin Anil <ro...@gmail.com>.
>
>
>
> I've seen the output of Fuzzy K-Means: ClusterDumper can show the number of
> clusters and the top terms associated with some numerical value.
>
> I would like to know on what's the next step after that in order to
> identify
> if document N's score for cluster 0, 1, 2, etc. Do i have to refer to a
> sequence file (not sure which one to read, prolly
> tokenized-documents/part-m-00000) then do some calculation to achieve the
> relevancy of an article/document based on the provided scores of terms in
> ClusterDumper per cluster?

clustering algorithms have a flag to write output file with assignment
between a document id and the cluster id. enabling this will generate a
folder in the output named clusteredPoints, see the --help options to know
what this flag is. You just need to dump the sequence file to understand
what it is.

>


>