You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Delroy Cameron <de...@gmail.com> on 2010/05/24 20:57:36 UTC

Re: where are the points in each cluster - kmeans clusterdump

hey Jeff, 

where can i find some documentation on the use of the -cl parameter? 
it seems missing from the wiki and does not appear in the help list for
seqdirectory or seq2sparse or clusterdump

-----
--cheers
Delroy
-- 
View this message in context: http://lucene.472066.n3.nabble.com/where-are-the-points-in-each-cluster-kmeans-clusterdump-tp838683p840389.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: where are the points in each cluster - kmeans clusterdump

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Hi Delroy,

Thanks for the positive feedback. I'm glad to hear you are up and running.

Jeff

On 5/26/10 3:29 PM, Delroy Cameron wrote:
> Jeff i must thank you.
>
> these exchanges have been very productive.
> the patch you suggested worked like butter...
> i now have the doc ids i need from the cluster dump.
>
> thank you kind Sir...
>
> -----
> --cheers
> Delroy
>    


Re: where are the points in each cluster - kmeans clusterdump

Posted by Delroy Cameron <de...@gmail.com>.
Jeff i must thank you. 

these exchanges have been very productive.
the patch you suggested worked like butter...
i now have the doc ids i need from the cluster dump.

thank you kind Sir...

-----
--cheers
Delroy
-- 
View this message in context: http://lucene.472066.n3.nabble.com/where-are-the-points-in-each-cluster-kmeans-clusterdump-tp838683p846354.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: where are the points in each cluster - kmeans clusterdump

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Indeed it has changed quite a bit recently. Vectors formerly had a name 
field which has been removed. This allowed documentIds to be carried 
along in their term vectors. The refactoring introduced NamedVectors 
which wrap normal vectors to carry along such names. As NamedVectors are 
also Vectors they flow through the various jobs transparently.

When you run seq2sparse on a set of text documents, it produces an 
output sequence file <Text, VectorWritable> which has the documentId in 
the key field but not in the value field. This looks like it should add 
the documentId to a NamedVector instead. The following patch seems to 
correct the problem though it needs more testing:

Index: 
utils/src/main/java/org/apache/mahout/utils/vectors/text/term/TFPartialVectorReducer.java
===================================================================
--- 
utils/src/main/java/org/apache/mahout/utils/vectors/text/term/TFPartialVectorReducer.java    
(revision 948493)
+++ 
utils/src/main/java/org/apache/mahout/utils/vectors/text/term/TFPartialVectorReducer.java    
(working copy)
@@ -35,6 +35,7 @@
  import org.apache.lucene.analysis.shingle.ShingleFilter;
  import org.apache.lucene.analysis.tokenattributes.TermAttribute;
  import org.apache.mahout.common.StringTuple;
+import org.apache.mahout.math.NamedVector;
  import org.apache.mahout.math.RandomAccessSparseVector;
  import org.apache.mahout.math.SequentialAccessSparseVector;
  import org.apache.mahout.math.Vector;
@@ -102,7 +103,7 @@
      }
      // if the vector has no nonZero entries (nothing in the 
dictionary), let's not waste space sending it to disk.
      if(vector.getNumNondefaultElements() > 0) {
-      VectorWritable vectorWritable = new VectorWritable(vector);
+      VectorWritable vectorWritable = new VectorWritable(new 
NamedVector(vector, key.toString()));
        output.collect(key, vectorWritable);
      } else {
        reporter.incrCounter("TFParticalVectorReducer", 
"emptyVectorCount", 1);
Index: 
utils/src/main/java/org/apache/mahout/utils/vectors/tfidf/TFIDFPartialVectorReducer.java
===================================================================
--- 
utils/src/main/java/org/apache/mahout/utils/vectors/tfidf/TFIDFPartialVectorReducer.java    
(revision 948493)
+++ 
utils/src/main/java/org/apache/mahout/utils/vectors/tfidf/TFIDFPartialVectorReducer.java    
(working copy)
@@ -33,11 +33,12 @@
  import org.apache.hadoop.mapred.OutputCollector;
  import org.apache.hadoop.mapred.Reducer;
  import org.apache.hadoop.mapred.Reporter;
+import org.apache.mahout.math.NamedVector;
  import org.apache.mahout.math.RandomAccessSparseVector;
  import org.apache.mahout.math.SequentialAccessSparseVector;
  import org.apache.mahout.math.Vector;
+import org.apache.mahout.math.VectorWritable;
  import org.apache.mahout.math.Vector.Element;
-import org.apache.mahout.math.VectorWritable;
  import org.apache.mahout.math.map.OpenIntLongHashMap;
  import org.apache.mahout.utils.vectors.TFIDF;
  import org.apache.mahout.utils.vectors.common.PartialVectorMerger;
@@ -85,7 +86,7 @@
      if (sequentialAccess) {
        vector = new SequentialAccessSparseVector(vector);
      }
-    VectorWritable vectorWritable = new VectorWritable(vector);
+    VectorWritable vectorWritable = new VectorWritable(new 
NamedVector(vector, key.toString()));
      output.collect(key, vectorWritable);
    }



On 5/26/10 11:20 AM, Delroy Cameron wrote:
> yeah Jeff,
> the implementation for printing the points has changed. Instead of a list of
> strings for each point, we now have a list of WeightedVectorWritable
> objects. The problem is that in the previous implementation getting the
> point id (i.e. the document id for each document in the cluster) was
> straight forward..see below
>
> After looking at the API for the code and testing a few output variations on
> points output. i am forced to ask..are the ids for the points in the
> WeightedVectorWritable object?
>
>   List<String>  points =
> clusterIdToPoints.get(String.valueOf(cluster.getId()));
>          if (points != null) {
>            writer.write("\tPoints: ");
>            for (Iterator<String>  iterator = points.iterator();
> iterator.hasNext();) {
>              String point = iterator.next();
>              writer.append(point);
>              if (iterator.hasNext()) {
>                writer.append(", ");
>              }
>            }
>            writer.write('\n');
>          }
>
> Top Terms:
> 		were                                    =>    32.23076923076923
> 		expression                              =>   27.333333333333332
> 		gene                                    =>   23.076923076923077
> 		from                                    =>   19.641025641025642
> 		cells                                   =>    17.76923076923077
> 		c                                       =>    16.23076923076923
> 		1                                       =>    14.76923076923077
> 		human                                   =>   14.487179487179487
> 		5                                       =>   13.820512820512821
> 		we                                      =>   13.179487179487179
> 	Points: 10075717, 10330009, 10419905, 10811945, 11116137, 11222753,
> 11691919
>
> List<WeightedVectorWritable>  points =
> clusterIdToPoints.get(cluster.getId());
>          if (points != null) {
>            writer.write("\tWeight:  Point:\n\t");
>            for (Iterator<WeightedVectorWritable>  iterator =
> points.iterator(); iterator.hasNext();) {
>              WeightedVectorWritable point = iterator.next();
>              writer.append(Double.toString(point.getWeight())).append(": ");
>              writer.append(ClusterBase.formatVector(point.getVector().get(),
> dictionary));
>              if (iterator.hasNext()) {
>                writer.append("\n\t");
>              }
>            }
>            writer.write('\n');
>          }
>
> Top Terms:
>                  riele                                   =>
> 14.00426959991455
>                  meredith                                =>
> 12.727957301669651
>                  lysine-6                                =>
> 11.388569796526873
>                  amores                                  =>
> 10.307115837379738
>                  mashimo                                 =>
> 9.840165774027506
>                  halks                                   =>
> 9.598452267823395
>                  maseki                                  =>
> 8.773765140109592
>                  lysine-63                               =>
> 8.496143341064453
>                  saporita                                =>
> 8.167389004318803
>                  a94                                     =>
> 8.119972387949625
>          Weight:  Point:
>          1.0: [265:1.016, 1753:3.503, 2087:2.217, 2162:2.396, 2217:1.347,
> 2702:1.054, 2886:1.125, 2974:2.472, 3197:1.603, 3472:1.902, 3714:1.658,
> 3789:1.735, 4003:1.538, 4168:3.849, 4387:6.602, 4399:3.800, 4513:1.717,
> 4640:1.387, ...]
>
>
> -----
> --cheers
> Delroy
>    


Re: where are the points in each cluster - kmeans clusterdump

Posted by Delroy Cameron <de...@gmail.com>.
yeah Jeff, 
the implementation for printing the points has changed. Instead of a list of
strings for each point, we now have a list of WeightedVectorWritable
objects. The problem is that in the previous implementation getting the
point id (i.e. the document id for each document in the cluster) was
straight forward..see below

After looking at the API for the code and testing a few output variations on
points output. i am forced to ask..are the ids for the points in the
WeightedVectorWritable object?

 List<String> points =
clusterIdToPoints.get(String.valueOf(cluster.getId()));
        if (points != null) {
          writer.write("\tPoints: ");
          for (Iterator<String> iterator = points.iterator();
iterator.hasNext();) {
            String point = iterator.next();
            writer.append(point);
            if (iterator.hasNext()) {
              writer.append(", ");
            }
          }
          writer.write('\n');
        }

Top Terms: 
		were                                    =>   32.23076923076923
		expression                              =>  27.333333333333332
		gene                                    =>  23.076923076923077
		from                                    =>  19.641025641025642
		cells                                   =>   17.76923076923077
		c                                       =>   16.23076923076923
		1                                       =>   14.76923076923077
		human                                   =>  14.487179487179487
		5                                       =>  13.820512820512821
		we                                      =>  13.179487179487179
	Points: 10075717, 10330009, 10419905, 10811945, 11116137, 11222753,
11691919

List<WeightedVectorWritable> points =
clusterIdToPoints.get(cluster.getId());
        if (points != null) {
          writer.write("\tWeight:  Point:\n\t");
          for (Iterator<WeightedVectorWritable> iterator =
points.iterator(); iterator.hasNext();) {
            WeightedVectorWritable point = iterator.next();
            writer.append(Double.toString(point.getWeight())).append(": ");
            writer.append(ClusterBase.formatVector(point.getVector().get(),
dictionary));
            if (iterator.hasNext()) {
              writer.append("\n\t");
            }
          }
          writer.write('\n');
        }

Top Terms:
                riele                                   =>  
14.00426959991455
                meredith                                => 
12.727957301669651
                lysine-6                                => 
11.388569796526873
                amores                                  => 
10.307115837379738
                mashimo                                 =>  
9.840165774027506
                halks                                   =>  
9.598452267823395
                maseki                                  =>  
8.773765140109592
                lysine-63                               =>  
8.496143341064453
                saporita                                =>  
8.167389004318803
                a94                                     =>  
8.119972387949625
        Weight:  Point:
        1.0: [265:1.016, 1753:3.503, 2087:2.217, 2162:2.396, 2217:1.347,
2702:1.054, 2886:1.125, 2974:2.472, 3197:1.603, 3472:1.902, 3714:1.658,
3789:1.735, 4003:1.538, 4168:3.849, 4387:6.602, 4399:3.800, 4513:1.717,
4640:1.387, ...]


-----
--cheers
Delroy
-- 
View this message in context: http://lucene.472066.n3.nabble.com/where-are-the-points-in-each-cluster-kmeans-clusterdump-tp838683p845600.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: where are the points in each cluster - kmeans clusterdump

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Its a little imprecise and does not show the actual option names, but 
https://cwiki.apache.org/MAHOUT/k-means.html talks about runClustering 
along with other options in Running k-Means Clustering. The same sort of 
language is on the other clustering pages too. These paragraphs are 
describing the arguments to the Java jobs and not the command lines 
themselves. I agree that a reference card for the all the command line 
options for all routines would be a useful addition. We do have 
MAHOUT-294 open already under which to put this action.

Would it be more friendly to make its default be true?

Jeff

On 5/24/10 11:57 AM, Delroy Cameron wrote:
> hey Jeff,
>
> where can i find some documentation on the use of the -cl parameter?
> it seems missing from the wiki and does not appear in the help list for
> seqdirectory or seq2sparse or clusterdump
>
> -----
> --cheers
> Delroy
>