You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Andrew Harbick (Updated) (JIRA)" <ji...@apache.org> on 2012/03/30 00:23:25 UTC
[jira] [Updated] (MAHOUT-996) Support NamedVectors in arff.vector
job by convention
[ https://issues.apache.org/jira/browse/MAHOUT-996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Harbick updated MAHOUT-996:
----------------------------------
Description:
If you do something like:
MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout arff.vector --input $PWD/file.arff --dictOut file.bindings --output $PWD
MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout kmeans --input $PWD/file.arff.mvc --clusters $PWD/output/file.clusters --output $PWD/output --numClusters 3 --maxIter 1000 --clustering
MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir $PWD/output/clusters-*-final --pointsDir $PWD/output/clusteredPoints --output $PWD/output/clusteranalyze.txt
Currently you don't get any information out of clusterdump that helps you identify which element from your source data is in which cluster.
I did an patch for illustration of using an attribute (by convention) from the ARFF file as the name for a NamedVector. The result of clusterdump is much easier to use:
VL-18589{n=6165 c=[1.376, 879.144, 3.947, 10.691, 0.874, 1.266, 16.644, 9.689, 2.207, 1.855] r=[0.484, 160.571, 1.959, 6.176, 0.551, 0.442, 34.125, 7.953, 1.988, 0.352]}
Weight : [props - optional]: Point:
1.0: 4ee342afd04516354c000140 = [1.000, 597.000, 7.000, 7.000, 1.000, 1.000, 11.000, 12.000, 6.000, 2.000]
1.0: 4ee49257eb8b3e28c60025a2 = [1.000, 597.000, 1.000, 7.000, 1.000, 1.000, 8.000, 17.000, 6.000, 2.000]
1.0: 4ee60430ab2c714006000937 = [1.000, 597.000, 2.000, 9.000, 1.000, 1.000, 21.000, 21.000, 2.000, 2.000]
1.0: 4ef2d580ab2c71231b0019ae = [0:1.000, 1:598.000, 2:5.000, 3:3.000, 5:1.000, 6:4.000, 9:1.000]
1.0: 4eda14a30b5d3e655b0043e9 = [1.000, 599.000, 7.000, 8.000, 2.000, 1.000, 15.000, 7.000, 3.000, 2.000]
1.0: 4edba62deb8b3e27e6000614 = [0:1.000, 1:599.000, 2:1.000, 3:12.000, 4:1.000, 5:1.000, 6:3.000, 8:3.000, 9:2.000]
1.0: 4ede1ea6eb8b3e1f330050f4 = [0:1.000, 1:599.000, 2:3.000, 3:9.000, 4:1.000, 5:1.000, 6:14.000, 7:20.000, 9:2.000]
...
I haven't done serious Java in 15 years so the attached patch is just for idea sake...
Thanks,
Andy
was:
If you do something like:
MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout arff.vector --input $PWD/file.arff --dictOut file.bindings --output $PWD
MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout kmeans --input $PWD/file.arff.mvc --clusters $PWD/output/file.clusters --output $PWD/output --numClusters 3 --maxIter 1000 --clustering
MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir $PWD/output/clusters-*-final --pointsDir $PWD/output/clusteredPoints --output $PWD/output/clusteranalyze.txt
Currently you don't get any information out of clusterdump that helps you identify which element from your source data is in which cluster.
I did an patch for illustration of using an attribute (by convention) from the ARFF file as the name for a NamedVector. The result of clusterdump is much easier to use:
VL-18589{n=6165 c=[1.376, 879.144, 3.947, 10.691, 0.874, 1.266, 16.644, 9.689, 2.207, 1.855] r=[0.484, 160.571, 1.959, 6.176, 0.551, 0.442, 34.125, 7.953, 1.988, 0.352]}
Weight : [props - optional]: Point:
1.0: 4ee342afd04516354c000140 = [1.000, 597.000, 7.000, 7.000, 1.000, 1.000, 11.000, 12.000, 6.000, 2.000]
1.0: 4ee49257eb8b3e28c60025a2 = [1.000, 597.000, 1.000, 7.000, 1.000, 1.000, 8.000, 17.000, 6.000, 2.000]
1.0: 4ee60430ab2c714006000937 = [1.000, 597.000, 2.000, 9.000, 1.000, 1.000, 21.000, 21.000, 2.000, 2.000]
1.0: 4ef2d580ab2c71231b0019ae = [0:1.000, 1:598.000, 2:5.000, 3:3.000, 5:1.000, 6:4.000, 9:1.000]
1.0: 4eda14a30b5d3e655b0043e9 = [1.000, 599.000, 7.000, 8.000, 2.000, 1.000, 15.000, 7.000, 3.000, 2.000]
1.0: 4edba62deb8b3e27e6000614 = [0:1.000, 1:599.000, 2:1.000, 3:12.000, 4:1.000, 5:1.000, 6:3.000, 8:3.000, 9:2.000]
1.0: 4ede1ea6eb8b3e1f330050f4 = [0:1.000, 1:599.000, 2:3.000, 3:9.000, 4:1.000, 5:1.000, 6:14.000, 7:20.000, 9:2.000]
...
Here what the patch looks like (again it's just for illustration.. I haven't done serious Java in 15 years).
Index: integration/src/main/java/org/apache/mahout/utils/vectors/arff/ARFFIterator.java
===================================================================
--- integration/src/main/java/org/apache/mahout/utils/vectors/arff/ARFFIterator.java (revision 1305503)
+++ integration/src/main/java/org/apache/mahout/utils/vectors/arff/ARFFIterator.java (working copy)
@@ -19,11 +19,14 @@
import java.io.BufferedReader;
import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
import java.util.regex.Pattern;
import com.google.common.collect.AbstractIterator;
import com.google.common.io.Closeables;
import org.apache.mahout.math.DenseVector;
+import org.apache.mahout.math.NamedVector;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.Vector;
@@ -71,11 +74,32 @@
result.setQuick(idx, model.getValue(data, idx));
}
} else {
- result = new DenseVector(model.getLabelSize());
- String[] splits = COMMA_PATTERN.split(line);
- for (int i = 0; i < splits.length; i++) {
- result.setQuick(i, model.getValue(splits[i], i));
+ ArrayList<String> splits = new ArrayList<String>(Arrays.asList(COMMA_PATTERN.split(line)));
+ DenseVector dv = null;
+
+ // If there is a vector_name attribute then we'll create a NamedVector as
+ // our result and exclude the value from our DenseVector.
+ Integer vectorNameIdx = model.getLabelIndex("vector_name");
+ if (vectorNameIdx != null) {
+ dv = new DenseVector(model.getLabelSize()-1);
}
+ else {
+ dv = new DenseVector(model.getLabelSize());
+ }
+
+ int j = 0;
+ for (int i = 0; i < splits.size(); i++) {
+ if (i != vectorNameIdx.intValue()) {
+ dv.setQuick(j++, model.getValue(splits.get(i), i));
+ }
+ }
+
Thanks,
Andy
> Support NamedVectors in arff.vector job by convention
> -----------------------------------------------------
>
> Key: MAHOUT-996
> URL: https://issues.apache.org/jira/browse/MAHOUT-996
> Project: Mahout
> Issue Type: Improvement
> Components: Integration
> Affects Versions: 0.7
> Environment: OS X
> Reporter: Andrew Harbick
> Priority: Minor
> Fix For: 0.7
>
>
> If you do something like:
> MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout arff.vector --input $PWD/file.arff --dictOut file.bindings --output $PWD
> MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout kmeans --input $PWD/file.arff.mvc --clusters $PWD/output/file.clusters --output $PWD/output --numClusters 3 --maxIter 1000 --clustering
> MAHOUT_LOCAL=1 $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir $PWD/output/clusters-*-final --pointsDir $PWD/output/clusteredPoints --output $PWD/output/clusteranalyze.txt
> Currently you don't get any information out of clusterdump that helps you identify which element from your source data is in which cluster.
> I did an patch for illustration of using an attribute (by convention) from the ARFF file as the name for a NamedVector. The result of clusterdump is much easier to use:
> VL-18589{n=6165 c=[1.376, 879.144, 3.947, 10.691, 0.874, 1.266, 16.644, 9.689, 2.207, 1.855] r=[0.484, 160.571, 1.959, 6.176, 0.551, 0.442, 34.125, 7.953, 1.988, 0.352]}
> Weight : [props - optional]: Point:
> 1.0: 4ee342afd04516354c000140 = [1.000, 597.000, 7.000, 7.000, 1.000, 1.000, 11.000, 12.000, 6.000, 2.000]
> 1.0: 4ee49257eb8b3e28c60025a2 = [1.000, 597.000, 1.000, 7.000, 1.000, 1.000, 8.000, 17.000, 6.000, 2.000]
> 1.0: 4ee60430ab2c714006000937 = [1.000, 597.000, 2.000, 9.000, 1.000, 1.000, 21.000, 21.000, 2.000, 2.000]
> 1.0: 4ef2d580ab2c71231b0019ae = [0:1.000, 1:598.000, 2:5.000, 3:3.000, 5:1.000, 6:4.000, 9:1.000]
> 1.0: 4eda14a30b5d3e655b0043e9 = [1.000, 599.000, 7.000, 8.000, 2.000, 1.000, 15.000, 7.000, 3.000, 2.000]
> 1.0: 4edba62deb8b3e27e6000614 = [0:1.000, 1:599.000, 2:1.000, 3:12.000, 4:1.000, 5:1.000, 6:3.000, 8:3.000, 9:2.000]
> 1.0: 4ede1ea6eb8b3e1f330050f4 = [0:1.000, 1:599.000, 2:3.000, 3:9.000, 4:1.000, 5:1.000, 6:14.000, 7:20.000, 9:2.000]
> ...
> I haven't done serious Java in 15 years so the attached patch is just for idea sake...
> Thanks,
> Andy
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira