You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Suneel Marthi (JIRA)" <ji...@apache.org> on 2014/01/26 22:10:38 UTC

[jira] [Commented] (MAHOUT-996) Support NamedVectors in arff.vector job by convention

    [ https://issues.apache.org/jira/browse/MAHOUT-996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882426#comment-13882426 ] 

Suneel Marthi commented on MAHOUT-996:
--------------------------------------

This can be marked as 'Resolved' in light of the recent fix for M-1410 wherein we now have VectorIds to identify which vector's in what cluster.

> Support NamedVectors in arff.vector job by convention
> -----------------------------------------------------
>
>                 Key: MAHOUT-996
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-996
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Integration
>    Affects Versions: 0.7
>         Environment: OS X
>            Reporter: Andrew Harbick
>            Assignee: Sebastian Schelter
>            Priority: Minor
>             Fix For: Backlog
>
>         Attachments: forillustration.patch
>
>
> If you do something like:
> MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout arff.vector --input $PWD/file.arff --dictOut file.bindings --output $PWD
> MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout kmeans --input $PWD/file.arff.mvc --clusters $PWD/output/file.clusters --output $PWD/output --numClusters 3 --maxIter 1000 --clustering
> MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir $PWD/output/clusters-*-final --pointsDir $PWD/output/clusteredPoints --output $PWD/output/clusteranalyze.txt
> Currently you don't get any information out of clusterdump that helps you identify which element from your source data is in which cluster.
> I did an patch for illustration of using an attribute (by convention) from the ARFF file as the name for a NamedVector.  The result of clusterdump is much easier to use:
> VL-18589{n=6165 c=[1.376, 879.144, 3.947, 10.691, 0.874, 1.266, 16.644, 9.689, 2.207, 1.855] r=[0.484, 160.571, 1.959, 6.176, 0.551, 0.442, 34.125, 7.953, 1.988, 0.352]}
>         Weight : [props - optional]:  Point:
>         1.0: 4ee342afd04516354c000140 = [1.000, 597.000, 7.000, 7.000, 1.000, 1.000, 11.000, 12.000, 6.000, 2.000]
>         1.0: 4ee49257eb8b3e28c60025a2 = [1.000, 597.000, 1.000, 7.000, 1.000, 1.000, 8.000, 17.000, 6.000, 2.000]
>         1.0: 4ee60430ab2c714006000937 = [1.000, 597.000, 2.000, 9.000, 1.000, 1.000, 21.000, 21.000, 2.000, 2.000]
>         1.0: 4ef2d580ab2c71231b0019ae = [0:1.000, 1:598.000, 2:5.000, 3:3.000, 5:1.000, 6:4.000, 9:1.000]
>         1.0: 4eda14a30b5d3e655b0043e9 = [1.000, 599.000, 7.000, 8.000, 2.000, 1.000, 15.000, 7.000, 3.000, 2.000]
>         1.0: 4edba62deb8b3e27e6000614 = [0:1.000, 1:599.000, 2:1.000, 3:12.000, 4:1.000, 5:1.000, 6:3.000, 8:3.000, 9:2.000]
>         1.0: 4ede1ea6eb8b3e1f330050f4 = [0:1.000, 1:599.000, 2:3.000, 3:9.000, 4:1.000, 5:1.000, 6:14.000, 7:20.000, 9:2.000]
> ...
> I haven't done serious Java in 15 years so the attached patch is just for idea sake...
> Thanks,
> Andy



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)