You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Bob Morris <mo...@gmail.com> on 2014/03/24 00:25:10 UTC

newbie asks how to making dictionary files

I'm a mahout novice trying to do some semantic data clustering with
Canopy clustering on some low-dimensional SequenceFiles that I
vectorized with ad-hoc java code. (Some features are strings
vextorized by the Levenstein distance from a constant, some are
DateTime objects vectorized as milliseconds from the Unix era, some
are georeferences, etc. etc.).  The results look promising, but I want
to get more detail out of the clusters, than I understand how to get
from ClusterDumper alone.  In particulary, it seems that
CSVClusterWriter should get me what I need (for each cluster, the
center and the list of vectors ordered by distance.

When I vectorized, I never explicitly built a Dictionary, which is---I
suppose---why I get a runtime ClassCastException when I invoke
ClusterDumper.readPoints(...), despite telling the ClusterDumper run
method that the dictionary type is "sequencefile", but have no
sequencefile to offer.

So I have these questions:
1. Am I right that the Exception in the dumper is caused by not having
a Dictionary file?
2. Where can I find documentation for the correct form of a
sequencefile Dictionary and are there any convenience methods for
building it? I start with a CSV file for the data, together with a Map
that associates column header names with a private type name that
specifies the algorithm to be applied to the vectorization.) I can
send the vectorization if helpful.

Thanks in advance;
--Bob
 Here's the dumper code with point of ClassCastException indicated

public void test() throws Exception {
   String datasetDir = "Lichen/"; // bbg, Rubiaeceae/ or fungi/ for now
   String inputFile =  "/tmp/vectors"; //inputDir + "vectors"; // input,
   String canopyOutput = "/tmp/clusters";
   String dumperInput = canopyOutput+"/clusters-0-final";
   String dumperOutput = "/tmp/clusters.txt";
   String clusterInput = dumperInput+"/"+"part-r-00000";
   String clusterOutput = "/tmp/clusterDetail.txt";
   boolean runSequential = true;
   try {
      String[] args = {"-i", inputFile, "-o", canopyOutput, "-t1",
".00000002", "-t2", ".00000001",  "-ow"};

      CanopyDriver driver = new CanopyDriver();
      driver.run(args);
      //must need Path to the sequence file here also?
      String[] dumpArgs = {"-i", dumperInput, "-o", dumperOutput,
"-dt", "sequencefile"};
      ClusterDumper dumper = new ClusterDumper();
      dumper.run(dumpArgs);

      PrintWriter writer = new PrintWriter(new File(clusterOutput));
      Path pointsPathDir = new Path(dumperInput);
      Configuration conf=new Configuration();
       ////// Line below throws runtime
                                                  ////
       //////   java.lang.ClassCastException:
org.apache.hadoop.io.Text cannot be cast to
org.apache.hadoop.io.IntWritable////
      // Presumably need a Dictionary to pass via -d to ClusterDumper
      Map<Integer, List<WeightedPropertyVectorWritable>>  clusterIdToPoints =
         ClusterDumper.readPoints(pointsPathDir, 10000, conf);
     //TODO: iterate over Map and output with CSVClusterWriter
csvClusterWriter = new CSVClusterWriter(writer, clusterIdToPoints,
measure);

   } catch (Exception e) {
       System.out.println("test caught Exception");
       e.printStackTrace(System.out);
   }
}



-- 
Robert A. Morris

Emeritus Professor  of Computer Science
UMASS-Boston
100 Morrissey Blvd
Boston, MA 02125-3390


Filtered Push Project
Harvard University Herbaria
Harvard University

email: morris.bob@gmail.com
web: http://efg.cs.umb.edu/
web: http://wiki.filteredpush.org
http://www.cs.umb.edu/~ram
===
The content of this communication is made entirely on my
own behalf and in no way should be deemed to express
official positions of The University of Massachusetts at Boston or
Harvard University.

Re: newbie asks how to making dictionary files

Posted by Pat Ferrel <pa...@occamsmachete.com>.
This is from way back in old brain cells that may be suspect.

The dictionary is created in the text pipeline to map tokens to Mahout ids. It allows Clusterdump to tell you what the frequent terms are in clusters instead of the numbers used as internal Mahout as IDs.

You must have some mapping yourself that you originally used to vectorize your data? Something like “ampule” => 23 or the like for the other data types?

I wouldn’t try to make the dictionary work. Just reverse the mapping from the internal Mahout Ids to your external Ids.
23 => “ampule”

Don’t give clusterdump a dictionary—it's optional. I use it on data with an external dictionary all the time.

On Mar 23, 2014, at 4:25 PM, Bob Morris <mo...@gmail.com> wrote:

I'm a mahout novice trying to do some semantic data clustering with
Canopy clustering on some low-dimensional SequenceFiles that I
vectorized with ad-hoc java code. (Some features are strings
vextorized by the Levenstein distance from a constant, some are
DateTime objects vectorized as milliseconds from the Unix era, some
are georeferences, etc. etc.).  The results look promising, but I want
to get more detail out of the clusters, than I understand how to get
from ClusterDumper alone.  In particulary, it seems that
CSVClusterWriter should get me what I need (for each cluster, the
center and the list of vectors ordered by distance.

When I vectorized, I never explicitly built a Dictionary, which is---I
suppose---why I get a runtime ClassCastException when I invoke
ClusterDumper.readPoints(...), despite telling the ClusterDumper run
method that the dictionary type is "sequencefile", but have no
sequencefile to offer.

So I have these questions:
1. Am I right that the Exception in the dumper is caused by not having
a Dictionary file?
2. Where can I find documentation for the correct form of a
sequencefile Dictionary and are there any convenience methods for
building it? I start with a CSV file for the data, together with a Map
that associates column header names with a private type name that
specifies the algorithm to be applied to the vectorization.) I can
send the vectorization if helpful.

Thanks in advance;
--Bob
Here's the dumper code with point of ClassCastException indicated

public void test() throws Exception {
  String datasetDir = "Lichen/"; // bbg, Rubiaeceae/ or fungi/ for now
  String inputFile =  "/tmp/vectors"; //inputDir + "vectors"; // input,
  String canopyOutput = "/tmp/clusters";
  String dumperInput = canopyOutput+"/clusters-0-final";
  String dumperOutput = "/tmp/clusters.txt";
  String clusterInput = dumperInput+"/"+"part-r-00000";
  String clusterOutput = "/tmp/clusterDetail.txt";
  boolean runSequential = true;
  try {
     String[] args = {"-i", inputFile, "-o", canopyOutput, "-t1",
".00000002", "-t2", ".00000001",  "-ow"};

     CanopyDriver driver = new CanopyDriver();
     driver.run(args);
     //must need Path to the sequence file here also?
     String[] dumpArgs = {"-i", dumperInput, "-o", dumperOutput,
"-dt", "sequencefile"};
     ClusterDumper dumper = new ClusterDumper();
     dumper.run(dumpArgs);

     PrintWriter writer = new PrintWriter(new File(clusterOutput));
     Path pointsPathDir = new Path(dumperInput);
     Configuration conf=new Configuration();
      ////// Line below throws runtime
                                                 ////
      //////   java.lang.ClassCastException:
org.apache.hadoop.io.Text cannot be cast to
org.apache.hadoop.io.IntWritable////
     // Presumably need a Dictionary to pass via -d to ClusterDumper
     Map<Integer, List<WeightedPropertyVectorWritable>>  clusterIdToPoints =
        ClusterDumper.readPoints(pointsPathDir, 10000, conf);
    //TODO: iterate over Map and output with CSVClusterWriter
csvClusterWriter = new CSVClusterWriter(writer, clusterIdToPoints,
measure);

  } catch (Exception e) {
      System.out.println("test caught Exception");
      e.printStackTrace(System.out);
  }
}



-- 
Robert A. Morris

Emeritus Professor  of Computer Science
UMASS-Boston
100 Morrissey Blvd
Boston, MA 02125-3390


Filtered Push Project
Harvard University Herbaria
Harvard University

email: morris.bob@gmail.com
web: http://efg.cs.umb.edu/
web: http://wiki.filteredpush.org
http://www.cs.umb.edu/~ram
===
The content of this communication is made entirely on my
own behalf and in no way should be deemed to express
official positions of The University of Massachusetts at Boston or
Harvard University.