You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Jake Mannix (Commented) (JIRA)" <ji...@apache.org> on 2011/12/01 04:27:41 UTC

[jira] [Commented] (MAHOUT-845) Make cluster top terms code more reusable

    [ https://issues.apache.org/jira/browse/MAHOUT-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160621#comment-13160621 ] 

Jake Mannix commented on MAHOUT-845:
------------------------------------

Ok, so I've thought about this a little, and the implementation that Frank put on here, and I had on my github branch too, essentially, is probably a bad idea, for exactly Lance's points mentioned here.

So instead, we modify VectorDumper and VectorHelper to add a couple of static methods and options:

in VectorHelper:
[code]
public static String vectorToJson(Vector vector, String[] dictionary, int maxEntries, boolean sort)
[code]

where the "sort" option sorts by the values of the Vector entries, and maxEntries describes the maximum number of vector entries to use.  If dictionary is supplied and not null, then the vector indexes are replaced with their respective term entries in the dictionary.

This way, VectorDumper is modified with the following options:
[code]
Option sortVectorsOpt = obuilder.withLongName("sortVectors").withRequired(false).withDescription(
            "Sort output key/value pairs of the vector entries in abs magnitude descending order")
            .withShortName("sort").create();
Option numIndexesPerVectorOpt = obuilder.withLongName("vectorSize").withShortName("vs").withRequired(false)
         .withArgument(abuilder.withName("vs").withMinimum(1).withMaximum(1).create())
         .withDescription("Truncate vectors to <vs> length when dumping (most useful when in"
                          + " conjunction with -sort").create();
[code]

Then if you have clusters represented as vector centroids (or distributions over terms/features, or anything else which is a collection of Vectors linked to a dictionary of String labels for the vector indexes), then you don't really need a "ClusterDumper", as

[code]
$MAHOUT_HOME/bin/mahout vectordump -s "path/to/vectors/part-*" --dictionary "path/to/dictionary.file-0" -dt sequencefile -sort --vectorSize 100 -o local_vectors.json
[code]

puts each vector in "path/to/vectors/part-*" one per line in local_vectors.json, in json format, with the keys being the terms with the highest weight for the vector, the values being the vector values, and only the top 100 (by value) per vector are emitted.

I've found this modification to VectorDumper invaluable in inspecting LDA topic models, but doing it without modifying the Vector interface is even better.
                
> Make cluster top terms code more reusable
> -----------------------------------------
>
>                 Key: MAHOUT-845
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-845
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Frank Scholten
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: MAHOUT-845.patch, MAHOUT-845.patch, MAHOUT-845.patch
>
>
> When working with Mahout text clustering I find that I keep writing code similar to the contents of
> public static String getTopFeatures(Cluster cluster, String[] dictionary, int numTerms)
> in ClusterDumper in order to determine cluster labels.
> I think it would be useful if (parts of) this code are added to the cluster or vector API so that you could do something like
> Cluster cluster = ... // get the cluster from seq file iterable
> String clusterLabel = cluster.getTopTerms(1, dictionary); // Do something with the label  
> I think this would make it easier to export and post-process clustering results, like indexing or storing them elsewhere.
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira