You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Andrew Harbick (Updated) (JIRA)" <ji...@apache.org> on 2012/03/30 00:23:25 UTC
[jira] [Updated] (MAHOUT-996) Support NamedVectors in arff.vector job by convention

     [ https://issues.apache.org/jira/browse/MAHOUT-996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Harbick updated MAHOUT-996:
----------------------------------

    Description: 
If you do something like:

MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout arff.vector --input $PWD/file.arff --dictOut file.bindings --output $PWD
MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout kmeans --input $PWD/file.arff.mvc --clusters $PWD/output/file.clusters --output $PWD/output --numClusters 3 --maxIter 1000 --clustering
MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir $PWD/output/clusters-*-final --pointsDir $PWD/output/clusteredPoints --output $PWD/output/clusteranalyze.txt

Currently you don't get any information out of clusterdump that helps you identify which element from your source data is in which cluster.

I did an patch for illustration of using an attribute (by convention) from the ARFF file as the name for a NamedVector.  The result of clusterdump is much easier to use:

VL-18589{n=6165 c=[1.376, 879.144, 3.947, 10.691, 0.874, 1.266, 16.644, 9.689, 2.207, 1.855] r=[0.484, 160.571, 1.959, 6.176, 0.551, 0.442, 34.125, 7.953, 1.988, 0.352]}
        Weight : [props - optional]:  Point:
        1.0: 4ee342afd04516354c000140 = [1.000, 597.000, 7.000, 7.000, 1.000, 1.000, 11.000, 12.000, 6.000, 2.000]
        1.0: 4ee49257eb8b3e28c60025a2 = [1.000, 597.000, 1.000, 7.000, 1.000, 1.000, 8.000, 17.000, 6.000, 2.000]
        1.0: 4ee60430ab2c714006000937 = [1.000, 597.000, 2.000, 9.000, 1.000, 1.000, 21.000, 21.000, 2.000, 2.000]
        1.0: 4ef2d580ab2c71231b0019ae = [0:1.000, 1:598.000, 2:5.000, 3:3.000, 5:1.000, 6:4.000, 9:1.000]
        1.0: 4eda14a30b5d3e655b0043e9 = [1.000, 599.000, 7.000, 8.000, 2.000, 1.000, 15.000, 7.000, 3.000, 2.000]
        1.0: 4edba62deb8b3e27e6000614 = [0:1.000, 1:599.000, 2:1.000, 3:12.000, 4:1.000, 5:1.000, 6:3.000, 8:3.000, 9:2.000]
        1.0: 4ede1ea6eb8b3e1f330050f4 = [0:1.000, 1:599.000, 2:3.000, 3:9.000, 4:1.000, 5:1.000, 6:14.000, 7:20.000, 9:2.000]

...

I haven't done serious Java in 15 years so the attached patch is just for idea sake...

Thanks,
Andy

  was:
If you do something like:

MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout arff.vector --input $PWD/file.arff --dictOut file.bindings --output $PWD
MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout kmeans --input $PWD/file.arff.mvc --clusters $PWD/output/file.clusters --output $PWD/output --numClusters 3 --maxIter 1000 --clustering
MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir $PWD/output/clusters-*-final --pointsDir $PWD/output/clusteredPoints --output $PWD/output/clusteranalyze.txt

Currently you don't get any information out of clusterdump that helps you identify which element from your source data is in which cluster.

I did an patch for illustration of using an attribute (by convention) from the ARFF file as the name for a NamedVector.  The result of clusterdump is much easier to use:

VL-18589{n=6165 c=[1.376, 879.144, 3.947, 10.691, 0.874, 1.266, 16.644, 9.689, 2.207, 1.855] r=[0.484, 160.571, 1.959, 6.176, 0.551, 0.442, 34.125, 7.953, 1.988, 0.352]}
        Weight : [props - optional]:  Point:
        1.0: 4ee342afd04516354c000140 = [1.000, 597.000, 7.000, 7.000, 1.000, 1.000, 11.000, 12.000, 6.000, 2.000]
        1.0: 4ee49257eb8b3e28c60025a2 = [1.000, 597.000, 1.000, 7.000, 1.000, 1.000, 8.000, 17.000, 6.000, 2.000]
        1.0: 4ee60430ab2c714006000937 = [1.000, 597.000, 2.000, 9.000, 1.000, 1.000, 21.000, 21.000, 2.000, 2.000]
        1.0: 4ef2d580ab2c71231b0019ae = [0:1.000, 1:598.000, 2:5.000, 3:3.000, 5:1.000, 6:4.000, 9:1.000]
        1.0: 4eda14a30b5d3e655b0043e9 = [1.000, 599.000, 7.000, 8.000, 2.000, 1.000, 15.000, 7.000, 3.000, 2.000]
        1.0: 4edba62deb8b3e27e6000614 = [0:1.000, 1:599.000, 2:1.000, 3:12.000, 4:1.000, 5:1.000, 6:3.000, 8:3.000, 9:2.000]
        1.0: 4ede1ea6eb8b3e1f330050f4 = [0:1.000, 1:599.000, 2:3.000, 3:9.000, 4:1.000, 5:1.000, 6:14.000, 7:20.000, 9:2.000]

...

Here what the patch looks like (again it's just for illustration..  I haven't done serious Java in 15 years).

Index: integration/src/main/java/org/apache/mahout/utils/vectors/arff/ARFFIterator.java
===================================================================
--- integration/src/main/java/org/apache/mahout/utils/vectors/arff/ARFFIterator.java    (revision 1305503)
+++ integration/src/main/java/org/apache/mahout/utils/vectors/arff/ARFFIterator.java    (working copy)
@@ -19,11 +19,14 @@
 
 import java.io.BufferedReader;
 import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
 import java.util.regex.Pattern;
 
 import com.google.common.collect.AbstractIterator;
 import com.google.common.io.Closeables;
 import org.apache.mahout.math.DenseVector;
+import org.apache.mahout.math.NamedVector;
 import org.apache.mahout.math.RandomAccessSparseVector;
 import org.apache.mahout.math.Vector;
 
@@ -71,11 +74,32 @@
         result.setQuick(idx, model.getValue(data, idx));
       }
     } else {
-      result = new DenseVector(model.getLabelSize());
-      String[] splits = COMMA_PATTERN.split(line);
-      for (int i = 0; i < splits.length; i++) {
-        result.setQuick(i, model.getValue(splits[i], i));
+      ArrayList<String> splits = new ArrayList<String>(Arrays.asList(COMMA_PATTERN.split(line)));
+      DenseVector dv = null;
+
+      //  If there is a vector_name attribute then we'll create a NamedVector as
+      //  our result and exclude the value from our DenseVector.
+      Integer vectorNameIdx = model.getLabelIndex("vector_name");
+      if (vectorNameIdx != null)  {
+        dv = new DenseVector(model.getLabelSize()-1);
       }
+      else  {
+        dv = new DenseVector(model.getLabelSize());
+      }
+      
+      int j = 0;
+      for (int i = 0; i < splits.size(); i++) {
+        if (i != vectorNameIdx.intValue())  {
+          dv.setQuick(j++, model.getValue(splits.get(i), i));
+        }
+      }
+      

Thanks,
Andy

    
> Support NamedVectors in arff.vector job by convention
> -----------------------------------------------------
>
>                 Key: MAHOUT-996
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-996
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Integration
>    Affects Versions: 0.7
>         Environment: OS X
>            Reporter: Andrew Harbick
>            Priority: Minor
>             Fix For: 0.7
>
>
> If you do something like:
> MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout arff.vector --input $PWD/file.arff --dictOut file.bindings --output $PWD
> MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout kmeans --input $PWD/file.arff.mvc --clusters $PWD/output/file.clusters --output $PWD/output --numClusters 3 --maxIter 1000 --clustering
> MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir $PWD/output/clusters-*-final --pointsDir $PWD/output/clusteredPoints --output $PWD/output/clusteranalyze.txt
> Currently you don't get any information out of clusterdump that helps you identify which element from your source data is in which cluster.
> I did an patch for illustration of using an attribute (by convention) from the ARFF file as the name for a NamedVector.  The result of clusterdump is much easier to use:
> VL-18589{n=6165 c=[1.376, 879.144, 3.947, 10.691, 0.874, 1.266, 16.644, 9.689, 2.207, 1.855] r=[0.484, 160.571, 1.959, 6.176, 0.551, 0.442, 34.125, 7.953, 1.988, 0.352]}
>         Weight : [props - optional]:  Point:
>         1.0: 4ee342afd04516354c000140 = [1.000, 597.000, 7.000, 7.000, 1.000, 1.000, 11.000, 12.000, 6.000, 2.000]
>         1.0: 4ee49257eb8b3e28c60025a2 = [1.000, 597.000, 1.000, 7.000, 1.000, 1.000, 8.000, 17.000, 6.000, 2.000]
>         1.0: 4ee60430ab2c714006000937 = [1.000, 597.000, 2.000, 9.000, 1.000, 1.000, 21.000, 21.000, 2.000, 2.000]
>         1.0: 4ef2d580ab2c71231b0019ae = [0:1.000, 1:598.000, 2:5.000, 3:3.000, 5:1.000, 6:4.000, 9:1.000]
>         1.0: 4eda14a30b5d3e655b0043e9 = [1.000, 599.000, 7.000, 8.000, 2.000, 1.000, 15.000, 7.000, 3.000, 2.000]
>         1.0: 4edba62deb8b3e27e6000614 = [0:1.000, 1:599.000, 2:1.000, 3:12.000, 4:1.000, 5:1.000, 6:3.000, 8:3.000, 9:2.000]
>         1.0: 4ede1ea6eb8b3e1f330050f4 = [0:1.000, 1:599.000, 2:3.000, 3:9.000, 4:1.000, 5:1.000, 6:14.000, 7:20.000, 9:2.000]
> ...
> I haven't done serious Java in 15 years so the attached patch is just for idea sake...
> Thanks,
> Andy

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira