You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Rob Podolski <ro...@yahoo.co.uk> on 2011/11/09 12:17:42 UTC

NewsKMeansClustering - the result most people want seems to be missing

Hi

Managed to get the Manning Chap 09 example NewsKMeansClustering  working with my own documents.  However, I thought the main point of this was to cluster the news articles together to get groups of similar content.  


The example allows you to get the cluster membership in terms of WeightedVectorWritables.  But most of us want to know which actual news articles are in the cluster - not which numeric results are in a cluster (though this is useful for getting the most significant terms in the vector albeit indirectly).


It seems to me that the only way of achieving this most useful result would be to used NamedVectors from the very onset and assign document identifiers to the name-label in each.  Then presumably these would survive the pipe-line through the various calls like


DictionaryVectorizer.createTermFrequencyVectors;
TFIDFConverter.processTfIdf;
etc

However, I have not seen a way of doing this.  Anyone got any ideas?


The other thing I explored was whether there was a way of correlating the output WeightedVectorWritables with the original documents.  However, there is not even an equals() method on the WeightedVectorWritables to allow it (though that would be a bad solution anyhow).

I'm new to Mahout and have to admit I've been struggling even to get this far.  Any help would be gratefully received.


R

Re: NewsKMeansClustering - the result most people want seems to be missing

Posted by Rob Podolski <ro...@yahoo.co.uk>.
Many thanks.  Actually I delved into the source code and found out that if you set the (undocumented) namedVector boolean to true in...

        DictionaryVectorizer.createTermFrequencyVectors(
            tokenizedPath,
            new Path(OUTPUT_HFS_FOLDER), 
            conf, 
            minFrequencyToSupport, // minimum frequency to allow (1 mostly)
            maxNGramSize, // Maximum size of n-gram to allow
            minLLRValue, // Minimum log likelihood ratio
            -1f, 
            true, 
            reduceTasks,
            chunkSize, 
            sequentialAccessOutput, 
            true); // Modified so that named vectors are used - document id used as name apparently


and...

        TFIDFConverter.processTfIdf(
          new Path(OUTPUT_HFS_FOLDER , DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
          new Path(OUTPUT_HFS_FOLDER), 
          conf, 
          chunkSize, 
          minDf,
          maxDFPercent, 
          2, 
          true, 
          sequentialAccessOutput, 
          true, // Modified so named vectors are used 
          reduceTasks);

that the code uses NamedVector's with your document id's as the names.  For the printing of the output at the end you can then cast the vectors to NamedVectors and retrieve the name (document id). Hence you can get the document id's against the IntWritable cluster numbers.


Many thanks though - I will certainly try what you suggested out too.

R


________________________________
From: Grant Ingersoll <gs...@apache.org>
To: user@mahout.apache.org; Rob Podolski <ro...@yahoo.co.uk>
Sent: Thursday, 10 November 2011, 7:20
Subject: Re: NewsKMeansClustering  - the result most people want seems to be missing


On Nov 9, 2011, at 3:17 AM, Rob Podolski wrote:

> Hi
> 
> Managed to get the Manning Chap 09 example NewsKMeansClustering  working with my own documents.  However, I thought the main point of this was to cluster the news articles together to get groups of similar content.  
> 
> 
> The example allows you to get the cluster membership in terms of WeightedVectorWritables.  But most of us want to know which actual news articles are in the cluster - not which numeric results are in a cluster (though this is useful for getting the most significant terms in the vector albeit indirectly).
> 
> 
> It seems to me that the only way of achieving this most useful result would be to used NamedVectors from the very onset and assign document identifiers to the name-label in each.  Then presumably these would survive the pipe-line through the various calls like
> 
> 
> DictionaryVectorizer.createTermFrequencyVectors;
> TFIDFConverter.processTfIdf;
> etc
> 
> However, I have not seen a way of doing this.  Anyone got any ideas?

You should be able to pass in --namedVectors to the seq2sparse command, and those named vectors should be preserved throughout the process.  From build-asf-email.sh in trunk:
$MAHOUT seq2sparse --input $MAIL_OUT --output $SEQ2SP --norm 2 --weight TFIDF --namedVector --maxDFPercent 90 --minSupport 2 --analyzerName org.apache.mahout.text.MailArchivesClusteringAnalyzer



> 
> 
> The other thing I explored was whether there was a way of correlating the output WeightedVectorWritables with the original documents.  However, there is not even an equals() method on the WeightedVectorWritables to allow it (though that would be a bad solution anyhow).

See the ClusterDumper code.

> 
> I'm new to Mahout and have to admit I've been struggling even to get this far.  Any help would be gratefully received.
> 
> 
> R

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: NewsKMeansClustering - the result most people want seems to be missing

Posted by Grant Ingersoll <gs...@apache.org>.
On Nov 9, 2011, at 3:17 AM, Rob Podolski wrote:

> Hi
> 
> Managed to get the Manning Chap 09 example NewsKMeansClustering  working with my own documents.  However, I thought the main point of this was to cluster the news articles together to get groups of similar content.  
> 
> 
> The example allows you to get the cluster membership in terms of WeightedVectorWritables.  But most of us want to know which actual news articles are in the cluster - not which numeric results are in a cluster (though this is useful for getting the most significant terms in the vector albeit indirectly).
> 
> 
> It seems to me that the only way of achieving this most useful result would be to used NamedVectors from the very onset and assign document identifiers to the name-label in each.  Then presumably these would survive the pipe-line through the various calls like
> 
> 
> DictionaryVectorizer.createTermFrequencyVectors;
> TFIDFConverter.processTfIdf;
> etc
> 
> However, I have not seen a way of doing this.  Anyone got any ideas?

You should be able to pass in --namedVectors to the seq2sparse command, and those named vectors should be preserved throughout the process.  From build-asf-email.sh in trunk:
$MAHOUT seq2sparse --input $MAIL_OUT --output $SEQ2SP --norm 2 --weight TFIDF --namedVector --maxDFPercent 90 --minSupport 2 --analyzerName org.apache.mahout.text.MailArchivesClusteringAnalyzer



> 
> 
> The other thing I explored was whether there was a way of correlating the output WeightedVectorWritables with the original documents.  However, there is not even an equals() method on the WeightedVectorWritables to allow it (though that would be a bad solution anyhow).

See the ClusterDumper code.

> 
> I'm new to Mahout and have to admit I've been struggling even to get this far.  Any help would be gratefully received.
> 
> 
> R

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com