You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by praneet mhatre <pr...@gmail.com> on 2011/11/09 01:12:10 UTC

Dirichlet Clustering Output

Hello All,

I am trying to use Clustering algorithms to recover Software Architecture
by using static features of code (e.g. method invocations, field accesses,
etc).
To start with, I ran the TestClusterDumper ( using testDirichlet2()
function) on the sample example given. But I am not able to
interpret/visualize the results correctly
as* *I don't see any assignment of input vectors to clusters, just a model
of attributes. Is there an additional step to be performed to generate the
final assignment?

Here's my input and output for Number of Clusters=10 and Number of
Iterations=10.

*Input: *
private static final String[] DOCS = {
      "The quick red fox jumped over the lazy brown dogs.",
      "The quick brown fox jumped over the lazy red dogs.",
      "The quick red cat jumped over the lazy brown dogs.",
      "The quick brown cat jumped over the lazy red dogs.",
      "Mary had a little lamb whose fleece was white as snow.",
      "Mary had a little goat whose fleece was white as snow.",
      "Mary had a little lamb whose fleece was black as tar.",
      "Dick had a little goat whose fleece was white as snow.",
      "Moby Dick is a story of a whale and a man obsessed.",
      "Moby Bob is a story of a walrus and a man obsessed.",
      "Moby Dick is a story of a whale and a crazy man.",
      "The robber wore a black fleece jacket and a baseball cap.",
      "The robber wore a red fleece jacket and a baseball cap.",
      "The robber wore a white fleece jacket and a baseball cap.",
      "The English Springer Spaniel is the best of all dogs.",
      "Hitesh Crista Crista Joel Thomas Arthur Praneet Hitesh Crista.",
      "Hitesh Crista Thomas Yasser Arthur Arthur Praneet Hitesh Crista.",
      "Hitesh Crista Thomas Sara Maryam Arthur Praneet Hitesh Crista."};

*Output:*

Complete output:
https://docs.google.com/document/d/1ApOj-XwNMei1JYwcAoj7Vgzg_6sLtJqT55SBhZxL_V0/edit?hl=en_US

First two clusters:

DC-0 total= 110 model= GC:0{n=11 c=[arthur:0.777, baseball:0.683,
black:0.508, brown:0.415, cap:0.683, cat:0.508, crista:1.038, dogs:0.382,
fleece:0.988, goat:0.254, had:0.622, hitesh:0.966, jacket:0.683,
joel:0.291, jumped:0.415, lamb:0.508, lazy:0.415, little:0.622, mary:0.683,
maryam:0.291, over:0.415, praneet:0.683, quick:0.415, red:0.572,
robber:0.683, sara:0.291, snow:0.455, tar:0.291, thomas:0.683, white:0.622,
whose:0.622, wore:0.683, yasser:0.291] r=[arthur:1.295, baseball:1.115,
black:1.077, brown:0.880, cap:1.115, cat:1.077, crista:1.707, dogs:0.809,
fleece:0.902, goat:0.803, had:1.016, hitesh:1.577, jacket:1.115,
joel:0.919, jumped:0.880, lamb:1.077, lazy:0.880, little:1.016, mary:1.115,
maryam:0.919, over:0.880, praneet:1.115, quick:0.880, red:0.935,
robber:1.115, sara:0.919, snow:0.966, tar:0.919, thomas:1.115, white:1.016,
whose:1.016, wore:1.115, yasser:0.919]}

    Top Terms:
        crista                                  =>  1.0381627082824707
        fleece                                  =>   0.987780137495561
        hitesh                                  =>  0.9658091718500311
        arthur                                  =>  0.7772231968966398
        wore                                    =>  0.6829302094199441
        thomas                                  =>  0.6829302094199441
        robber                                  =>  0.6829302094199441
        praneet                                 =>  0.6829302094199441
        mary                                    =>  0.6829302094199441
        jacket                                  =>  0.6829302094199441

DC-1 total= 0 model= GC:1{n=0 c=[all:1.064, arthur:1.312, baseball:-0.362,
best:-1.437, black:1.155, bob:0.798, brown:0.708, cap:0.154, cat:-1.008,
crazy:0.891, crista:-0.032, dick:1.358, dogs:0.254, english:-0.159,
fleece:0.047, fox:-0.397, goat:0.353, had:-0.217, hitesh:-0.722,
jacket:-0.794, joel:0.906, jumped:0.511, lamb:-0.742, lazy:-1.627,
little:0.259, man:1.254, mary:1.073, maryam:-0.979, moby:1.377,
obsessed:1.655, over:-2.704, praneet:2.064, quick:-1.444, red:0.212,
robber:-0.880, sara:-0.788, snow:-2.024, spaniel:-2.043, springer:-0.129,
story:-0.556, tar:0.036, thomas:-0.539, walrus:-0.663, whale:-0.449,
white:-0.872, whose:-1.372, wore:1.300, yasser:-1.198] r=[all:1.739,
arthur:0.544, baseball:0.344, best:0.583, black:2.614, bob:1.700,
brown:0.289, cap:-0.749, cat:2.273, crazy:2.075, crista:0.912, dick:-2.777,
dogs:1.587, english:1.792, fleece:1.370, fox:-1.535, goat:-0.910,
had:3.608, hitesh:1.639, jacket:1.127, joel:0.604, jumped:1.631,
lamb:0.786, lazy:2.790, little:2.492, man:0.151, mary:1.611, maryam:-0.466,
moby:1.370, obsessed:1.017, over:0.066, praneet:0.194, quick:1.352,
red:0.450, robber:1.414, sara:1.427, snow:1.350, spaniel:-0.446,
springer:1.615, story:1.330, tar:0.477, thomas:0.619, walrus:1.990,
whale:1.013, white:1.335, whose:0.218, wore:0.231, yasser:1.284]}

    Top Terms:
        praneet                                 =>   2.064261989527272
        obsessed                                =>  1.6554940510057867
        moby                                    =>  1.3767884191330173
        dick                                    =>  1.3584694137334954
        arthur                                  =>    1.31230884601195
        wore                                    =>  1.3000443458409314
        man                                     =>  1.2543030335395073
        black                                   =>   1.155114056222531
        mary                                    =>  1.0725645217854314
        all                                     =>  1.0641885117052403

*
2) Also, on a related note, I get a NullPointerException for a relatively
higher number of iterations. For instance, with the same set of data
points, I encounter the exception when I use 10 clusters and 15 iterations*.
*Any thoughts on that?*

*3) There is an example clearly visualizing the clusters in case of
numerical 2-D data points. Is there a similar way to visualize text data
clusters?* *If it matters, scalability is currently not a big concern as I
am only dealing with a few hundred input vectors and attributes at this
point.  *

Thank you,


-- 
Praneet Mhatre
Graduate Student
Donald Bren School of ICS
University of California, Irvine

RE: Dirichlet Clustering Output

Posted by Jeff Eastman <je...@Narus.com>.

Sorry for the delay in responding. By now you may have already figured this out. If not:

1. Did you specify the -cl option on Dirichlet to emit the clusteredPoints directory? The default is not to do so.
2. Did you specify the -p option on ClusterDumper to use that directory?
3. Which model are you using on Dirichlet? The default GaussianCluster doesn't do well with wide (e.g. text-clustering) vectors due to numerical instabilities. See examples/bin/build-reuters.sh for the incantation to use DistanceMeasureCluster+CosineDistanceMeasure instead.
4. Never seen the NPE you describe. Can you include your command line and a stack dump?
5. With your small data set size, you should be using -xm sequential and not the default mapreduce execution mode. Easier to debug the NPE if it reoccurs too.
6. Check out DisplayDirichlet in examples which can visualize 2-d points and their clusters
7. I'd be interested to see if your experiment produces any results you can share. This sounds like a very unusual clustering application.

Jeff

-----Original Message-----
From: praneet mhatre [mailto:praneetmhatre@gmail.com] 
Sent: Tuesday, November 08, 2011 4:12 PM
To: user@mahout.apache.org
Subject: Dirichlet Clustering Output

Hello All,

I am trying to use Clustering algorithms to recover Software Architecture
by using static features of code (e.g. method invocations, field accesses,
etc).
To start with, I ran the TestClusterDumper ( using testDirichlet2()
function) on the sample example given. But I am not able to
interpret/visualize the results correctly
as* *I don't see any assignment of input vectors to clusters, just a model
of attributes. Is there an additional step to be performed to generate the
final assignment?

Here's my input and output for Number of Clusters=10 and Number of
Iterations=10.

*Input: *
private static final String[] DOCS = {
      "The quick red fox jumped over the lazy brown dogs.",
      "The quick brown fox jumped over the lazy red dogs.",
      "The quick red cat jumped over the lazy brown dogs.",
      "The quick brown cat jumped over the lazy red dogs.",
      "Mary had a little lamb whose fleece was white as snow.",
      "Mary had a little goat whose fleece was white as snow.",
      "Mary had a little lamb whose fleece was black as tar.",
      "Dick had a little goat whose fleece was white as snow.",
      "Moby Dick is a story of a whale and a man obsessed.",
      "Moby Bob is a story of a walrus and a man obsessed.",
      "Moby Dick is a story of a whale and a crazy man.",
      "The robber wore a black fleece jacket and a baseball cap.",
      "The robber wore a red fleece jacket and a baseball cap.",
      "The robber wore a white fleece jacket and a baseball cap.",
      "The English Springer Spaniel is the best of all dogs.",
      "Hitesh Crista Crista Joel Thomas Arthur Praneet Hitesh Crista.",
      "Hitesh Crista Thomas Yasser Arthur Arthur Praneet Hitesh Crista.",
      "Hitesh Crista Thomas Sara Maryam Arthur Praneet Hitesh Crista."};

*Output:*

Complete output:
https://docs.google.com/document/d/1ApOj-XwNMei1JYwcAoj7Vgzg_6sLtJqT55SBhZxL_V0/edit?hl=en_US

First two clusters:

DC-0 total= 110 model= GC:0{n=11 c=[arthur:0.777, baseball:0.683,
black:0.508, brown:0.415, cap:0.683, cat:0.508, crista:1.038, dogs:0.382,
fleece:0.988, goat:0.254, had:0.622, hitesh:0.966, jacket:0.683,
joel:0.291, jumped:0.415, lamb:0.508, lazy:0.415, little:0.622, mary:0.683,
maryam:0.291, over:0.415, praneet:0.683, quick:0.415, red:0.572,
robber:0.683, sara:0.291, snow:0.455, tar:0.291, thomas:0.683, white:0.622,
whose:0.622, wore:0.683, yasser:0.291] r=[arthur:1.295, baseball:1.115,
black:1.077, brown:0.880, cap:1.115, cat:1.077, crista:1.707, dogs:0.809,
fleece:0.902, goat:0.803, had:1.016, hitesh:1.577, jacket:1.115,
joel:0.919, jumped:0.880, lamb:1.077, lazy:0.880, little:1.016, mary:1.115,
maryam:0.919, over:0.880, praneet:1.115, quick:0.880, red:0.935,
robber:1.115, sara:0.919, snow:0.966, tar:0.919, thomas:1.115, white:1.016,
whose:1.016, wore:1.115, yasser:0.919]}

    Top Terms:
        crista                                  =>  1.0381627082824707
        fleece                                  =>   0.987780137495561
        hitesh                                  =>  0.9658091718500311
        arthur                                  =>  0.7772231968966398
        wore                                    =>  0.6829302094199441
        thomas                                  =>  0.6829302094199441
        robber                                  =>  0.6829302094199441
        praneet                                 =>  0.6829302094199441
        mary                                    =>  0.6829302094199441
        jacket                                  =>  0.6829302094199441

DC-1 total= 0 model= GC:1{n=0 c=[all:1.064, arthur:1.312, baseball:-0.362,
best:-1.437, black:1.155, bob:0.798, brown:0.708, cap:0.154, cat:-1.008,
crazy:0.891, crista:-0.032, dick:1.358, dogs:0.254, english:-0.159,
fleece:0.047, fox:-0.397, goat:0.353, had:-0.217, hitesh:-0.722,
jacket:-0.794, joel:0.906, jumped:0.511, lamb:-0.742, lazy:-1.627,
little:0.259, man:1.254, mary:1.073, maryam:-0.979, moby:1.377,
obsessed:1.655, over:-2.704, praneet:2.064, quick:-1.444, red:0.212,
robber:-0.880, sara:-0.788, snow:-2.024, spaniel:-2.043, springer:-0.129,
story:-0.556, tar:0.036, thomas:-0.539, walrus:-0.663, whale:-0.449,
white:-0.872, whose:-1.372, wore:1.300, yasser:-1.198] r=[all:1.739,
arthur:0.544, baseball:0.344, best:0.583, black:2.614, bob:1.700,
brown:0.289, cap:-0.749, cat:2.273, crazy:2.075, crista:0.912, dick:-2.777,
dogs:1.587, english:1.792, fleece:1.370, fox:-1.535, goat:-0.910,
had:3.608, hitesh:1.639, jacket:1.127, joel:0.604, jumped:1.631,
lamb:0.786, lazy:2.790, little:2.492, man:0.151, mary:1.611, maryam:-0.466,
moby:1.370, obsessed:1.017, over:0.066, praneet:0.194, quick:1.352,
red:0.450, robber:1.414, sara:1.427, snow:1.350, spaniel:-0.446,
springer:1.615, story:1.330, tar:0.477, thomas:0.619, walrus:1.990,
whale:1.013, white:1.335, whose:0.218, wore:0.231, yasser:1.284]}

    Top Terms:
        praneet                                 =>   2.064261989527272
        obsessed                                =>  1.6554940510057867
        moby                                    =>  1.3767884191330173
        dick                                    =>  1.3584694137334954
        arthur                                  =>    1.31230884601195
        wore                                    =>  1.3000443458409314
        man                                     =>  1.2543030335395073
        black                                   =>   1.155114056222531
        mary                                    =>  1.0725645217854314
        all                                     =>  1.0641885117052403

*
2) Also, on a related note, I get a NullPointerException for a relatively
higher number of iterations. For instance, with the same set of data
points, I encounter the exception when I use 10 clusters and 15 iterations*.
*Any thoughts on that?*

*3) There is an example clearly visualizing the clusters in case of
numerical 2-D data points. Is there a similar way to visualize text data
clusters?* *If it matters, scalability is currently not a big concern as I
am only dealing with a few hundred input vectors and attributes at this
point.  *

Thank you,


-- 
Praneet Mhatre
Graduate Student
Donald Bren School of ICS
University of California, Irvine