You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by "Whitmore, Mattie" <mw...@harris.com> on 2012/08/15 19:45:39 UTC

Mahout-279/kmeans++

Hi!

I have been using RandomSeedGenerator, and was hoping it had a patch like that described in Mahout-279 since I want only 10 vectors out of a set of more than 100,000,000.  I have been using canopy clustering for better results, but still need to do a few passes of kmeans to determine my T, and the random seed does take a long time.

The comments say that you are working on a kmeans++, I searched around but couldn't confirm any more information about it.  Is a scalable kmeans++ in the works? (I know research on the subject is quite new)

Thanks!



Mattie Whitmore
Mathematician/IR&D Software Engineer
HARRIS  Corporation - Advanced Information Solutions
301.837.5278
mwhitmor@harris.com<ma...@harris.com>

Re: Mahout-279/kmeans++

Posted by Ted Dunning <te...@gmail.com>.

One way to test this is to add a small amount of noise to all of your data
points.  This won't be easy from the command line, but is easy from Java.
 You can do this, for instance:

      Vector v = // read data as a vector
      Vector u = new DenseVector(v.size()).assign(Functions.random());
      v.assign(u, Functions.plusMult(0.1));


On Wed, Aug 22, 2012 at 10:40 AM, Whitmore, Mattie <mw...@harris.com>wrote:

> Yes, I have data which is exactly the same.  If I give every vector a name
> which is distinct (albeit the data point is the same as other points in the
> set) will this keep the algorithm from dropping non-distinct vectors/data
> points (which is what I THINK but have yet to verify is what is going on)?
>
> Thanks,
>
> Mattie
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Wednesday, August 22, 2012 1:18 PM
> To: user@mahout.apache.org
> Subject: Re: Mahout-279/kmeans++
>
> Just an off thought, do you have duplicate input points?
>
> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <mwhitmor@harris.com
> >wrote:
>
> > ... I have also verified by running canopy multiple times with 0.5 and
> 0.7
> > that there is a continual discrepancy between the two clustering
> versions.
> >  The max/min vectors in a cluster using 0.5 is: 19192158/215  and 0.7
> is:
> > 921998/5.  They should not necessarily be the same, since I am using
> canopy
> > clustering to find initial centroids, however I would think they would
> have
> > the same sum, which they do not (45901885 vs 1599154).
> >
> > Here is the method I am running:
> >
> > public static void KmeansClusteringCanopy(String outputDir, String T,
> > String itMax)
> >                         throws IOException, InterruptedException,
> > ClassNotFoundException,
> >                         InstantiationException, IllegalAccessException {
> >
> >                 Configuration conf = new Configuration();
> >
> >                 DistanceMeasure measure = new EuclideanDistanceMeasure();
> >
> >                 Path vectorsFolder = new Path(outputDir, "vectors");
> >                 Path clusterCenters = new Path(outputDir +
> > "-canopy/centriods");
> >                 Path clusterOutput = new Path(outputDir +
> > "-canopy/clusters");
> >
> >                 // create canopies instead of initial vectors
> >                 CanopyDriver.run(conf, vectorsFolder, clusterCenters,
> > measure,
> >                                 Double.parseDouble(T),
> > Double.parseDouble(T), false, 0, false);
> >
> >
> >                 // kmeans cluster operation
> >                 KMeansDriver.run(conf, vectorsFolder, new
> > Path(clusterCenters,
> >                                 "clusters-0-final/part-r-00000"),
> > clusterOutput, measure, 0.01,
> >                                 Integer.parseInt(itMax), true, 0.0,
> false);
> >
> >
> >                 //post process by putting completed clusters into their
> > own files.
> >                 ClusterOutputPostProcessorDriver.run(clusterOutput,
> >                                 new
> > Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
> >
> >         }
> >
> > What do you think?
> >
> > On another but related note: Is there a plan to have a method -- say
> > ClusterOutputPostProcessorDriver -- which when run outputs the vectors
> > within clusters as well as a separate folder containing pruned outliers?
> >
> > Thanks!
> >
> > Mattie
> >
> > -----Original Message-----
> > From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> > Sent: Friday, August 17, 2012 12:16 PM
> > To: user@mahout.apache.org
> > Subject: Re: Mahout-279/kmeans++
> >
> > The clustering algorithm has also changed internally. So, expect the
> > results to be different ( and better ).
> >
> > I can think of one reason for this behavior. Maybe lots of clusters are
> > having only one vector inside it, and, AFAIK, clusterdumper will not
> > output any cluster with single vector.
> > So, I think, its clusterdumper which is doing the invisible "pruning" (
> > by not ouputting clusters with single vectors ).
> >
> > Can you cross check the output once with
> ClusterOutputPostProcessorDriver?
> >
> > No, no tool can output the pruned vectors. The only way to see all
> > vectors assigned to any cluster is to set clusterClassificationThreshold
> > to 0.
> >
> > If you still face the problem, then please provide the parameters with
> > which you are calling kmeans.
> >
> > Regarding "I should also mention I have vectors which are exactly the
> > same (even their names), perhaps they are the ones being pruned, is that
> > possible? "
> >
> > The name of the vector has nothing to do with clustering, I am not sure
> > whether it will have any effect when clusterdumper is in action. So,
> > crosschecking with ClusterOutputPostProcessorDriver will answer this.
> >
> > Good luck.
> > Paritosh
> >
> > On 17-08-2012 21:07, Whitmore, Mattie wrote:
> > > Sure, I have a dataset which I wish to cluster using Kmeans.
>  Previously
> > (v0.5) when I did a clusterdump the total amount of vectors within the
> > resultant clusters was the same as the total amount fed to the algorithm.
> >  I wish this to be the case when clustering with v0.7.  The only change
> in
> > the algorithm is clusterClassificationThreshold,  I set this value to be
> 0
> > so that it will in fact cluster all vectors in the dataset.
> > >
> > > My logic here was no vector should have a probability of being in some
> > cluster less than 0 and therefore all vectors should cluster.
> > >
> > > However after running a clusterdump I find that vectors (1/3 roughly)
> > have been pruned.
> > >
> > > Is this a bug, or me just not understanding the new capabilities?
> > >
> > > I should also mention I have vectors which are exactly the same (even
> > their names), perhaps they are the ones being pruned, is that possible?
> > >
> > > Another question if I may: I will eventually want to use the pruning
> > capabilities, does the ClusterOutputPostProcessorDriver method (or a
> > similar method) have the capability of outputting the pruned vectors
> into a
> > folder?
> > >
> > > Thanks! Please let me know if I'm still not being clear enough.
> > >
> > > Mattie
> > >
> > > -----Original Message-----
> > > From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> > > Sent: Friday, August 17, 2012 11:20 AM
> > > To: user@mahout.apache.org
> > > Subject: Re: Mahout-279/kmeans++
> > >
> > > clusterClassificationThreshold is for outlier removal, and this is the
> > way it should be used.
> > >
> > > Can you provide some more information about your job and the way you
> are
> > calling it?
> > >
> > > And if I look at the code, the vector should be clustered even if the
> > pdf is 0. The method which decides whether the vector should be assigned
> to
> > a particular cluster or not -
> > >
> > > /**
> > >      * Decides whether the vector should be classified or not based on
> > the max pdf
> > >      * value of the clusters and threshold value.
> > >      *
> > >      * @return whether the vector should be classified or not.
> > >      */
> > >     private static boolean shouldClassify(Vector pdfPerCluster, Double
> > clusterClassificationThreshold) {
> > >       return pdfPerCluster.maxValue() >=
> clusterClassificationThreshold;
> > >     }
> > >
> > > On 17-08-2012 20:06, Whitmore, Mattie wrote:
> > >
> > >> Hi Ted,
> > >>
> > >> Yes this is great!  I hope to start working with this algorithm in the
> > next couple weeks.
> > >>
> > >> I have a question about the 0.7 implementation of kmeans and the
> > clusterClassificationThreshold,  I have this value set at zero, but the
> > output is still showing that about 1/3 of my data is not assigned to a
> > cluster in my output.  Am I using this value incorrectly?  I did a
> > kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned
> despite
> > the clusterClassificationThreshold = 0.
> > >>
> > >>
> > >> Thanks,
> > >>
> > >> Mattie
> > >>
> > >>
> > >> -----Original Message-----
> > >> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > >> Sent: Wednesday, August 15, 2012 5:20 PM
> > >> To: user@mahout.apache.org
> > >> Subject: Re: Mahout-279/kmeans++
> > >>
> > >> Mattie,
> > >>
> > >> Would this help?
> > >>
> > >>
> >
> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
> > >>
> > >> and
> > >>
> > >>
> >
> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
> > >>
> > >> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <
> mwhitmor@harris.com
> > >wrote:
> > >>
> > >>> Hi!
> > >>>
> > >>> I have been using RandomSeedGenerator, and was hoping it had a patch
> > like
> > >>> that described in Mahout-279 since I want only 10 vectors out of a
> set
> > of
> > >>> more than 100,000,000.  I have been using canopy clustering for
> better
> > >>> results, but still need to do a few passes of kmeans to determine my
> > T, and
> > >>> the random seed does take a long time.
> > >>>
> > >>> The comments say that you are working on a kmeans++, I searched
> around
> > but
> > >>> couldn't confirm any more information about it.  Is a scalable
> > kmeans++ in
> > >>> the works? (I know research on the subject is quite new)
> > >>>
> > >>> Thanks!
> > >>>
> > >>>
> > >>>
> > >>> Mattie Whitmore
> > >>> Mathematician/IR&D Software Engineer
> > >>> HARRIS  Corporation - Advanced Information Solutions
> > >>> 301.837.5278
> > >>> mwhitmor@harris.com<ma...@harris.com>
> > >>>
> > >>>
> > >>>
> > >>>
> > >
> >
>

Re: Mahout-279/kmeans++

Posted by Ted Dunning <te...@gmail.com>.

No.  The algorithm works either way.  The algorithm doesn't need the full
capabilities of a matrix since it just makes a few sequential passes
through the data.

On Thu, Aug 30, 2012 at 3:25 PM, Whitmore, Mattie <mw...@harris.com>wrote:

> Would the algorithm implement better as if given a matrix? I'm thinking of
> work done on extending matrix multiplication to tensor multiplication I
> suppose. That is neither here nor there for this current project.

Re: Mahout-279/kmeans++

Posted by Ted Dunning <te...@gmail.com>.

The names are outside the vector or matrix data.  Vectors and matrices
store numbers, not strings.

On Thu, Aug 30, 2012 at 3:25 PM, Whitmore, Mattie <mw...@harris.com>wrote:

> I was thinking that one column would be the name for each row -- like a
> "name column" for each vector in a matrix.  I probably mistyped somewhere
> in there :).  Would the algorithm implement better as if given a matrix?
> I'm thinking of work done on extending matrix multiplication to tensor
> multiplication I suppose. That is neither here nor there for this current
> project.
>
> Thanks for the guidance!
>
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Thursday, August 30, 2012 2:52 PM
> To: user@mahout.apache.org
> Subject: Re: Mahout-279/kmeans++
>
> But columns aren't what I would expect you to want labeled.  I think that
> row labels might be nicer.  Happily, each named vector has a name for the
> entire vector as well.
>
> On Thu, Aug 30, 2012 at 2:48 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > The input to the BallKmeans is actually not a matrix.  It is an
> > Iterable<MatrixSlice>.  This can be a matrix since a matrix implements
> > this.
> >
> > So one way to deal with this is to build your own Iterable and put
> > NamedVectors into it.  NamedVector retain labels as you want.
> >
> >
> > On Thu, Aug 30, 2012 at 12:53 PM, Whitmore, Mattie <mwhitmor@harris.com
> >wrote:
> >
> >> I need to be using the matrices for BallKmeans.  Can matrices be named?
> >> By this I mean can I assign a column of my matrix to be the "name" of
> each
> >> row?
> >>
> >> Thanks!
> >>
> >> -----Original Message-----
> >> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> >> Sent: Wednesday, August 29, 2012 12:17 PM
> >> To: user@mahout.apache.org
> >> Subject: Re: Mahout-279/kmeans++
> >>
> >> Yes.  The ball k-means implementation does use weights to indicate
> >> multiple
> >> vectors.
> >>
> >> The implementation is definitely ready to test.  I would be slightly
> >> surprised if it has absolutely zero issues, but your feedback on such
> >> issues would help them get fixed much sooner than others.
> >>
> >> On Wed, Aug 29, 2012 at 10:37 AM, Whitmore, Mattie <mwhitmor@harris.com
> >> >wrote:
> >>
> >> > I re-ran the canopy-kmeans analytic, this time with unique names, I
> lost
> >> > more points in the resulting clusters ( total points in the clusters =
> >> > 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5).  The
> >> total
> >> > number of data points fed into the algorithm is 53365862 -- so even
> >> v0.5 is
> >> > missing 14% of the data.
> >> >
> >> > I'm thinking if I weight these dense vectors with a weight equal to
> the
> >> > number of identical vectors in the set that could work -- Ball Kmeans
> >> seems
> >> > to do this.  Is this a correct interpretation of how to use weights in
> >> Ball
> >> > Kmeans, and is Ball Kmeans ready enough to be used/tested?
> >> >
> >> > Thanks
> >> >
> >> > -----Original Message-----
> >> > From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> >> > Sent: Thursday, August 23, 2012 12:34 PM
> >> > To: user@mahout.apache.org
> >> > Subject: Re: Mahout-279/kmeans++
> >> >
> >> > clusterDump works in memory, and there are no plans yet to make it
> >> > distributed ( or not in memory ). See thishttps://
> >> > issues.apache.org/*jira*/browse/MAHOUT-940
> >> >
> >> > clusterpp has an option for distributed processing, so you can process
> >> any
> >> > amount of data with it.
> >> >
> >> > On 23-08-2012 19:55, Whitmore, Mattie wrote:
> >> > > Yes, unique names will be my next plan -- I just can't kick off that
> >> job
> >> > until after the weekend.  If this makes no difference I will also try
> >> the
> >> > noise idea, and I'll follow up about both.
> >> > >
> >> > > My next question is regarding clusterDump.  Is there a way to run
> this
> >> > in parallel? I have found some code to execute in java (the preferable
> >> > method for me) but I would like the method to be faster and not in
> >> memory.
> >> >  Is this a possibility? Or in the works?
> >> > >
> >> > > Thanks!
> >> > >
> >> > > -----Original Message-----
> >> > > From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> >> > > Sent: Wednesday, August 22, 2012 9:09 PM
> >> > > To: user@mahout.apache.org
> >> > > Subject: Re: Mahout-279/kmeans++
> >> > >
> >> > > Can you also try to provide distinct names to vectors and then
> >> cluster?
> >> > > It should not have any affect, but would be good to know the
> behavior.
> >> > >
> >> > > On 22-08-2012 23:10, Whitmore, Mattie wrote:
> >> > >> Yes, I have data which is exactly the same.  If I give every
> vector a
> >> > name which is distinct (albeit the data point is the same as other
> >> points
> >> > in the set) will this keep the algorithm from dropping non-distinct
> >> > vectors/data points (which is what I THINK but have yet to verify is
> >> what
> >> > is going on)?
> >> > >>
> >> > >> Thanks,
> >> > >>
> >> > >> Mattie
> >> > >>
> >> > >> -----Original Message-----
> >> > >> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> >> > >> Sent: Wednesday, August 22, 2012 1:18 PM
> >> > >> To: user@mahout.apache.org
> >> > >> Subject: Re: Mahout-279/kmeans++
> >> > >>
> >> > >> Just an off thought, do you have duplicate input points?
> >> > >>
> >> > >> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <
> >> mwhitmor@harris.com
> >> > >wrote:
> >> > >>
> >> > >>> ... I have also verified by running canopy multiple times with 0.5
> >> and
> >> > 0.7
> >> > >>> that there is a continual discrepancy between the two clustering
> >> > versions.
> >> > >>>    The max/min vectors in a cluster using 0.5 is: 19192158/215 and
> >> > 0.7 is:
> >> > >>> 921998/5.  They should not necessarily be the same, since I am
> using
> >> > canopy
> >> > >>> clustering to find initial centroids, however I would think they
> >> would
> >> > have
> >> > >>> the same sum, which they do not (45901885 vs 1599154).
> >> > >>>
> >> > >>> Here is the method I am running:
> >> > >>>
> >> > >>> public static void KmeansClusteringCanopy(String outputDir, String
> >> T,
> >> > >>> String itMax)
> >> > >>>                           throws IOException,
> InterruptedException,
> >> > >>> ClassNotFoundException,
> >> > >>>                           InstantiationException,
> >> > IllegalAccessException {
> >> > >>>
> >> > >>>                   Configuration conf = new Configuration();
> >> > >>>
> >> > >>>                   DistanceMeasure measure = new
> >> > EuclideanDistanceMeasure();
> >> > >>>
> >> > >>>                   Path vectorsFolder = new Path(outputDir,
> >> "vectors");
> >> > >>>                   Path clusterCenters = new Path(outputDir +
> >> > >>> "-canopy/centriods");
> >> > >>>                   Path clusterOutput = new Path(outputDir +
> >> > >>> "-canopy/clusters");
> >> > >>>
> >> > >>>                   // create canopies instead of initial vectors
> >> > >>>                   CanopyDriver.run(conf, vectorsFolder,
> >> clusterCenters,
> >> > >>> measure,
> >> > >>>                                   Double.parseDouble(T),
> >> > >>> Double.parseDouble(T), false, 0, false);
> >> > >>>
> >> > >>>
> >> > >>>                   // kmeans cluster operation
> >> > >>>                   KMeansDriver.run(conf, vectorsFolder, new
> >> > >>> Path(clusterCenters,
> >> > >>>
> "clusters-0-final/part-r-00000"),
> >> > >>> clusterOutput, measure, 0.01,
> >> > >>>                                   Integer.parseInt(itMax), true,
> >> 0.0,
> >> > false);
> >> > >>>
> >> > >>>
> >> > >>>                   //post process by putting completed clusters
> into
> >> > their
> >> > >>> own files.
> >> > >>>
> >> ClusterOutputPostProcessorDriver.run(clusterOutput,
> >> > >>>                                   new
> >> > >>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
> >> > >>>
> >> > >>>           }
> >> > >>>
> >> > >>> What do you think?
> >> > >>>
> >> > >>> On another but related note: Is there a plan to have a method --
> say
> >> > >>> ClusterOutputPostProcessorDriver -- which when run outputs the
> >> vectors
> >> > >>> within clusters as well as a separate folder containing pruned
> >> > outliers?
> >> > >>>
> >> > >>> Thanks!
> >> > >>>
> >> > >>> Mattie
> >> > >>>
> >> > >>> -----Original Message-----
> >> > >>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> >> > >>> Sent: Friday, August 17, 2012 12:16 PM
> >> > >>> To: user@mahout.apache.org
> >> > >>> Subject: Re: Mahout-279/kmeans++
> >> > >>>
> >> > >>> The clustering algorithm has also changed internally. So, expect
> the
> >> > >>> results to be different ( and better ).
> >> > >>>
> >> > >>> I can think of one reason for this behavior. Maybe lots of
> clusters
> >> are
> >> > >>> having only one vector inside it, and, AFAIK, clusterdumper will
> not
> >> > >>> output any cluster with single vector.
> >> > >>> So, I think, its clusterdumper which is doing the invisible
> >> "pruning" (
> >> > >>> by not ouputting clusters with single vectors ).
> >> > >>>
> >> > >>> Can you cross check the output once with
> >> > ClusterOutputPostProcessorDriver?
> >> > >>>
> >> > >>> No, no tool can output the pruned vectors. The only way to see all
> >> > >>> vectors assigned to any cluster is to set
> >> > clusterClassificationThreshold
> >> > >>> to 0.
> >> > >>>
> >> > >>> If you still face the problem, then please provide the parameters
> >> with
> >> > >>> which you are calling kmeans.
> >> > >>>
> >> > >>> Regarding "I should also mention I have vectors which are exactly
> >> the
> >> > >>> same (even their names), perhaps they are the ones being pruned,
> is
> >> > that
> >> > >>> possible? "
> >> > >>>
> >> > >>> The name of the vector has nothing to do with clustering, I am not
> >> sure
> >> > >>> whether it will have any effect when clusterdumper is in action.
> So,
> >> > >>> crosschecking with ClusterOutputPostProcessorDriver will answer
> >> this.
> >> > >>>
> >> > >>> Good luck.
> >> > >>> Paritosh
> >> > >>>
> >> > >>> On 17-08-2012 21:07, Whitmore, Mattie wrote:
> >> > >>>> Sure, I have a dataset which I wish to cluster using Kmeans.
> >> >  Previously
> >> > >>> (v0.5) when I did a clusterdump the total amount of vectors within
> >> the
> >> > >>> resultant clusters was the same as the total amount fed to the
> >> > algorithm.
> >> > >>>    I wish this to be the case when clustering with v0.7.  The only
> >> > change in
> >> > >>> the algorithm is clusterClassificationThreshold,  I set this value
> >> to
> >> > be 0
> >> > >>> so that it will in fact cluster all vectors in the dataset.
> >> > >>>> My logic here was no vector should have a probability of being in
> >> some
> >> > >>> cluster less than 0 and therefore all vectors should cluster.
> >> > >>>> However after running a clusterdump I find that vectors (1/3
> >> roughly)
> >> > >>> have been pruned.
> >> > >>>> Is this a bug, or me just not understanding the new capabilities?
> >> > >>>>
> >> > >>>> I should also mention I have vectors which are exactly the same
> >> (even
> >> > >>> their names), perhaps they are the ones being pruned, is that
> >> possible?
> >> > >>>> Another question if I may: I will eventually want to use the
> >> pruning
> >> > >>> capabilities, does the ClusterOutputPostProcessorDriver method
> (or a
> >> > >>> similar method) have the capability of outputting the pruned
> vectors
> >> > into a
> >> > >>> folder?
> >> > >>>> Thanks! Please let me know if I'm still not being clear enough.
> >> > >>>>
> >> > >>>> Mattie
> >> > >>>>
> >> > >>>> -----Original Message-----
> >> > >>>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> >> > >>>> Sent: Friday, August 17, 2012 11:20 AM
> >> > >>>> To: user@mahout.apache.org
> >> > >>>> Subject: Re: Mahout-279/kmeans++
> >> > >>>>
> >> > >>>> clusterClassificationThreshold is for outlier removal, and this
> is
> >> the
> >> > >>> way it should be used.
> >> > >>>> Can you provide some more information about your job and the way
> >> you
> >> > are
> >> > >>> calling it?
> >> > >>>> And if I look at the code, the vector should be clustered even if
> >> the
> >> > >>> pdf is 0. The method which decides whether the vector should be
> >> > assigned to
> >> > >>> a particular cluster or not -
> >> > >>>> /**
> >> > >>>>        * Decides whether the vector should be classified or not
> >> based
> >> > on
> >> > >>> the max pdf
> >> > >>>>        * value of the clusters and threshold value.
> >> > >>>>        *
> >> > >>>>        * @return whether the vector should be classified or not.
> >> > >>>>        */
> >> > >>>>       private static boolean shouldClassify(Vector pdfPerCluster,
> >> > Double
> >> > >>> clusterClassificationThreshold) {
> >> > >>>>         return pdfPerCluster.maxValue() >=
> >> > clusterClassificationThreshold;
> >> > >>>>       }
> >> > >>>>
> >> > >>>> On 17-08-2012 20:06, Whitmore, Mattie wrote:
> >> > >>>>
> >> > >>>>> Hi Ted,
> >> > >>>>>
> >> > >>>>> Yes this is great!  I hope to start working with this algorithm
> in
> >> > the
> >> > >>> next couple weeks.
> >> > >>>>> I have a question about the 0.7 implementation of kmeans and the
> >> > >>> clusterClassificationThreshold,  I have this value set at zero,
> but
> >> the
> >> > >>> output is still showing that about 1/3 of my data is not assigned
> >> to a
> >> > >>> cluster in my output.  Am I using this value incorrectly?  I did a
> >> > >>> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned
> >> > despite
> >> > >>> the clusterClassificationThreshold = 0.
> >> > >>>>> Thanks,
> >> > >>>>>
> >> > >>>>> Mattie
> >> > >>>>>
> >> > >>>>>
> >> > >>>>> -----Original Message-----
> >> > >>>>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> >> > >>>>> Sent: Wednesday, August 15, 2012 5:20 PM
> >> > >>>>> To: user@mahout.apache.org
> >> > >>>>> Subject: Re: Mahout-279/kmeans++
> >> > >>>>>
> >> > >>>>> Mattie,
> >> > >>>>>
> >> > >>>>> Would this help?
> >> > >>>>>
> >> > >>>>>
> >> > >>>
> >> >
> >>
> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
> >> > >>>>> and
> >> > >>>>>
> >> > >>>>>
> >> > >>>
> >> >
> >>
> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
> >> > >>>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <
> >> > mwhitmor@harris.com
> >> > >>>> wrote:
> >> > >>>>>> Hi!
> >> > >>>>>>
> >> > >>>>>> I have been using RandomSeedGenerator, and was hoping it had a
> >> patch
> >> > >>> like
> >> > >>>>>> that described in Mahout-279 since I want only 10 vectors out
> of
> >> a
> >> > set
> >> > >>> of
> >> > >>>>>> more than 100,000,000.  I have been using canopy clustering for
> >> > better
> >> > >>>>>> results, but still need to do a few passes of kmeans to
> >> determine my
> >> > >>> T, and
> >> > >>>>>> the random seed does take a long time.
> >> > >>>>>>
> >> > >>>>>> The comments say that you are working on a kmeans++, I searched
> >> > around
> >> > >>> but
> >> > >>>>>> couldn't confirm any more information about it.  Is a scalable
> >> > >>> kmeans++ in
> >> > >>>>>> the works? (I know research on the subject is quite new)
> >> > >>>>>>
> >> > >>>>>> Thanks!
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>> Mattie Whitmore
> >> > >>>>>> Mathematician/IR&D Software Engineer
> >> > >>>>>> HARRIS  Corporation - Advanced Information Solutions
> >> > >>>>>> 301.837.5278
> >> > >>>>>> mwhitmor@harris.com<ma...@harris.com>
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>>
> >> > >
> >> >
> >> >
> >> >
> >>
> >
> >
>

RE: Mahout-279/kmeans++

Posted by "Whitmore, Mattie" <mw...@harris.com>.

I was thinking that one column would be the name for each row -- like a "name column" for each vector in a matrix.  I probably mistyped somewhere in there :).  Would the algorithm implement better as if given a matrix? I'm thinking of work done on extending matrix multiplication to tensor multiplication I suppose. That is neither here nor there for this current project.

Thanks for the guidance!


-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Thursday, August 30, 2012 2:52 PM
To: user@mahout.apache.org
Subject: Re: Mahout-279/kmeans++

But columns aren't what I would expect you to want labeled.  I think that
row labels might be nicer.  Happily, each named vector has a name for the
entire vector as well.

On Thu, Aug 30, 2012 at 2:48 PM, Ted Dunning <te...@gmail.com> wrote:

> The input to the BallKmeans is actually not a matrix.  It is an
> Iterable<MatrixSlice>.  This can be a matrix since a matrix implements
> this.
>
> So one way to deal with this is to build your own Iterable and put
> NamedVectors into it.  NamedVector retain labels as you want.
>
>
> On Thu, Aug 30, 2012 at 12:53 PM, Whitmore, Mattie <mw...@harris.com>wrote:
>
>> I need to be using the matrices for BallKmeans.  Can matrices be named?
>> By this I mean can I assign a column of my matrix to be the "name" of each
>> row?
>>
>> Thanks!
>>
>> -----Original Message-----
>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>> Sent: Wednesday, August 29, 2012 12:17 PM
>> To: user@mahout.apache.org
>> Subject: Re: Mahout-279/kmeans++
>>
>> Yes.  The ball k-means implementation does use weights to indicate
>> multiple
>> vectors.
>>
>> The implementation is definitely ready to test.  I would be slightly
>> surprised if it has absolutely zero issues, but your feedback on such
>> issues would help them get fixed much sooner than others.
>>
>> On Wed, Aug 29, 2012 at 10:37 AM, Whitmore, Mattie <mwhitmor@harris.com
>> >wrote:
>>
>> > I re-ran the canopy-kmeans analytic, this time with unique names, I lost
>> > more points in the resulting clusters ( total points in the clusters =
>> > 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5).  The
>> total
>> > number of data points fed into the algorithm is 53365862 -- so even
>> v0.5 is
>> > missing 14% of the data.
>> >
>> > I'm thinking if I weight these dense vectors with a weight equal to the
>> > number of identical vectors in the set that could work -- Ball Kmeans
>> seems
>> > to do this.  Is this a correct interpretation of how to use weights in
>> Ball
>> > Kmeans, and is Ball Kmeans ready enough to be used/tested?
>> >
>> > Thanks
>> >
>> > -----Original Message-----
>> > From: Paritosh Ranjan [mailto:pranjan@xebia.com]
>> > Sent: Thursday, August 23, 2012 12:34 PM
>> > To: user@mahout.apache.org
>> > Subject: Re: Mahout-279/kmeans++
>> >
>> > clusterDump works in memory, and there are no plans yet to make it
>> > distributed ( or not in memory ). See thishttps://
>> > issues.apache.org/*jira*/browse/MAHOUT-940
>> >
>> > clusterpp has an option for distributed processing, so you can process
>> any
>> > amount of data with it.
>> >
>> > On 23-08-2012 19:55, Whitmore, Mattie wrote:
>> > > Yes, unique names will be my next plan -- I just can't kick off that
>> job
>> > until after the weekend.  If this makes no difference I will also try
>> the
>> > noise idea, and I'll follow up about both.
>> > >
>> > > My next question is regarding clusterDump.  Is there a way to run this
>> > in parallel? I have found some code to execute in java (the preferable
>> > method for me) but I would like the method to be faster and not in
>> memory.
>> >  Is this a possibility? Or in the works?
>> > >
>> > > Thanks!
>> > >
>> > > -----Original Message-----
>> > > From: Paritosh Ranjan [mailto:pranjan@xebia.com]
>> > > Sent: Wednesday, August 22, 2012 9:09 PM
>> > > To: user@mahout.apache.org
>> > > Subject: Re: Mahout-279/kmeans++
>> > >
>> > > Can you also try to provide distinct names to vectors and then
>> cluster?
>> > > It should not have any affect, but would be good to know the behavior.
>> > >
>> > > On 22-08-2012 23:10, Whitmore, Mattie wrote:
>> > >> Yes, I have data which is exactly the same.  If I give every vector a
>> > name which is distinct (albeit the data point is the same as other
>> points
>> > in the set) will this keep the algorithm from dropping non-distinct
>> > vectors/data points (which is what I THINK but have yet to verify is
>> what
>> > is going on)?
>> > >>
>> > >> Thanks,
>> > >>
>> > >> Mattie
>> > >>
>> > >> -----Original Message-----
>> > >> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>> > >> Sent: Wednesday, August 22, 2012 1:18 PM
>> > >> To: user@mahout.apache.org
>> > >> Subject: Re: Mahout-279/kmeans++
>> > >>
>> > >> Just an off thought, do you have duplicate input points?
>> > >>
>> > >> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <
>> mwhitmor@harris.com
>> > >wrote:
>> > >>
>> > >>> ... I have also verified by running canopy multiple times with 0.5
>> and
>> > 0.7
>> > >>> that there is a continual discrepancy between the two clustering
>> > versions.
>> > >>>    The max/min vectors in a cluster using 0.5 is: 19192158/215  and
>> > 0.7 is:
>> > >>> 921998/5.  They should not necessarily be the same, since I am using
>> > canopy
>> > >>> clustering to find initial centroids, however I would think they
>> would
>> > have
>> > >>> the same sum, which they do not (45901885 vs 1599154).
>> > >>>
>> > >>> Here is the method I am running:
>> > >>>
>> > >>> public static void KmeansClusteringCanopy(String outputDir, String
>> T,
>> > >>> String itMax)
>> > >>>                           throws IOException, InterruptedException,
>> > >>> ClassNotFoundException,
>> > >>>                           InstantiationException,
>> > IllegalAccessException {
>> > >>>
>> > >>>                   Configuration conf = new Configuration();
>> > >>>
>> > >>>                   DistanceMeasure measure = new
>> > EuclideanDistanceMeasure();
>> > >>>
>> > >>>                   Path vectorsFolder = new Path(outputDir,
>> "vectors");
>> > >>>                   Path clusterCenters = new Path(outputDir +
>> > >>> "-canopy/centriods");
>> > >>>                   Path clusterOutput = new Path(outputDir +
>> > >>> "-canopy/clusters");
>> > >>>
>> > >>>                   // create canopies instead of initial vectors
>> > >>>                   CanopyDriver.run(conf, vectorsFolder,
>> clusterCenters,
>> > >>> measure,
>> > >>>                                   Double.parseDouble(T),
>> > >>> Double.parseDouble(T), false, 0, false);
>> > >>>
>> > >>>
>> > >>>                   // kmeans cluster operation
>> > >>>                   KMeansDriver.run(conf, vectorsFolder, new
>> > >>> Path(clusterCenters,
>> > >>>                                   "clusters-0-final/part-r-00000"),
>> > >>> clusterOutput, measure, 0.01,
>> > >>>                                   Integer.parseInt(itMax), true,
>> 0.0,
>> > false);
>> > >>>
>> > >>>
>> > >>>                   //post process by putting completed clusters into
>> > their
>> > >>> own files.
>> > >>>
>> ClusterOutputPostProcessorDriver.run(clusterOutput,
>> > >>>                                   new
>> > >>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
>> > >>>
>> > >>>           }
>> > >>>
>> > >>> What do you think?
>> > >>>
>> > >>> On another but related note: Is there a plan to have a method -- say
>> > >>> ClusterOutputPostProcessorDriver -- which when run outputs the
>> vectors
>> > >>> within clusters as well as a separate folder containing pruned
>> > outliers?
>> > >>>
>> > >>> Thanks!
>> > >>>
>> > >>> Mattie
>> > >>>
>> > >>> -----Original Message-----
>> > >>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
>> > >>> Sent: Friday, August 17, 2012 12:16 PM
>> > >>> To: user@mahout.apache.org
>> > >>> Subject: Re: Mahout-279/kmeans++
>> > >>>
>> > >>> The clustering algorithm has also changed internally. So, expect the
>> > >>> results to be different ( and better ).
>> > >>>
>> > >>> I can think of one reason for this behavior. Maybe lots of clusters
>> are
>> > >>> having only one vector inside it, and, AFAIK, clusterdumper will not
>> > >>> output any cluster with single vector.
>> > >>> So, I think, its clusterdumper which is doing the invisible
>> "pruning" (
>> > >>> by not ouputting clusters with single vectors ).
>> > >>>
>> > >>> Can you cross check the output once with
>> > ClusterOutputPostProcessorDriver?
>> > >>>
>> > >>> No, no tool can output the pruned vectors. The only way to see all
>> > >>> vectors assigned to any cluster is to set
>> > clusterClassificationThreshold
>> > >>> to 0.
>> > >>>
>> > >>> If you still face the problem, then please provide the parameters
>> with
>> > >>> which you are calling kmeans.
>> > >>>
>> > >>> Regarding "I should also mention I have vectors which are exactly
>> the
>> > >>> same (even their names), perhaps they are the ones being pruned, is
>> > that
>> > >>> possible? "
>> > >>>
>> > >>> The name of the vector has nothing to do with clustering, I am not
>> sure
>> > >>> whether it will have any effect when clusterdumper is in action. So,
>> > >>> crosschecking with ClusterOutputPostProcessorDriver will answer
>> this.
>> > >>>
>> > >>> Good luck.
>> > >>> Paritosh
>> > >>>
>> > >>> On 17-08-2012 21:07, Whitmore, Mattie wrote:
>> > >>>> Sure, I have a dataset which I wish to cluster using Kmeans.
>> >  Previously
>> > >>> (v0.5) when I did a clusterdump the total amount of vectors within
>> the
>> > >>> resultant clusters was the same as the total amount fed to the
>> > algorithm.
>> > >>>    I wish this to be the case when clustering with v0.7.  The only
>> > change in
>> > >>> the algorithm is clusterClassificationThreshold,  I set this value
>> to
>> > be 0
>> > >>> so that it will in fact cluster all vectors in the dataset.
>> > >>>> My logic here was no vector should have a probability of being in
>> some
>> > >>> cluster less than 0 and therefore all vectors should cluster.
>> > >>>> However after running a clusterdump I find that vectors (1/3
>> roughly)
>> > >>> have been pruned.
>> > >>>> Is this a bug, or me just not understanding the new capabilities?
>> > >>>>
>> > >>>> I should also mention I have vectors which are exactly the same
>> (even
>> > >>> their names), perhaps they are the ones being pruned, is that
>> possible?
>> > >>>> Another question if I may: I will eventually want to use the
>> pruning
>> > >>> capabilities, does the ClusterOutputPostProcessorDriver method (or a
>> > >>> similar method) have the capability of outputting the pruned vectors
>> > into a
>> > >>> folder?
>> > >>>> Thanks! Please let me know if I'm still not being clear enough.
>> > >>>>
>> > >>>> Mattie
>> > >>>>
>> > >>>> -----Original Message-----
>> > >>>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
>> > >>>> Sent: Friday, August 17, 2012 11:20 AM
>> > >>>> To: user@mahout.apache.org
>> > >>>> Subject: Re: Mahout-279/kmeans++
>> > >>>>
>> > >>>> clusterClassificationThreshold is for outlier removal, and this is
>> the
>> > >>> way it should be used.
>> > >>>> Can you provide some more information about your job and the way
>> you
>> > are
>> > >>> calling it?
>> > >>>> And if I look at the code, the vector should be clustered even if
>> the
>> > >>> pdf is 0. The method which decides whether the vector should be
>> > assigned to
>> > >>> a particular cluster or not -
>> > >>>> /**
>> > >>>>        * Decides whether the vector should be classified or not
>> based
>> > on
>> > >>> the max pdf
>> > >>>>        * value of the clusters and threshold value.
>> > >>>>        *
>> > >>>>        * @return whether the vector should be classified or not.
>> > >>>>        */
>> > >>>>       private static boolean shouldClassify(Vector pdfPerCluster,
>> > Double
>> > >>> clusterClassificationThreshold) {
>> > >>>>         return pdfPerCluster.maxValue() >=
>> > clusterClassificationThreshold;
>> > >>>>       }
>> > >>>>
>> > >>>> On 17-08-2012 20:06, Whitmore, Mattie wrote:
>> > >>>>
>> > >>>>> Hi Ted,
>> > >>>>>
>> > >>>>> Yes this is great!  I hope to start working with this algorithm in
>> > the
>> > >>> next couple weeks.
>> > >>>>> I have a question about the 0.7 implementation of kmeans and the
>> > >>> clusterClassificationThreshold,  I have this value set at zero, but
>> the
>> > >>> output is still showing that about 1/3 of my data is not assigned
>> to a
>> > >>> cluster in my output.  Am I using this value incorrectly?  I did a
>> > >>> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned
>> > despite
>> > >>> the clusterClassificationThreshold = 0.
>> > >>>>> Thanks,
>> > >>>>>
>> > >>>>> Mattie
>> > >>>>>
>> > >>>>>
>> > >>>>> -----Original Message-----
>> > >>>>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>> > >>>>> Sent: Wednesday, August 15, 2012 5:20 PM
>> > >>>>> To: user@mahout.apache.org
>> > >>>>> Subject: Re: Mahout-279/kmeans++
>> > >>>>>
>> > >>>>> Mattie,
>> > >>>>>
>> > >>>>> Would this help?
>> > >>>>>
>> > >>>>>
>> > >>>
>> >
>> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
>> > >>>>> and
>> > >>>>>
>> > >>>>>
>> > >>>
>> >
>> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
>> > >>>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <
>> > mwhitmor@harris.com
>> > >>>> wrote:
>> > >>>>>> Hi!
>> > >>>>>>
>> > >>>>>> I have been using RandomSeedGenerator, and was hoping it had a
>> patch
>> > >>> like
>> > >>>>>> that described in Mahout-279 since I want only 10 vectors out of
>> a
>> > set
>> > >>> of
>> > >>>>>> more than 100,000,000.  I have been using canopy clustering for
>> > better
>> > >>>>>> results, but still need to do a few passes of kmeans to
>> determine my
>> > >>> T, and
>> > >>>>>> the random seed does take a long time.
>> > >>>>>>
>> > >>>>>> The comments say that you are working on a kmeans++, I searched
>> > around
>> > >>> but
>> > >>>>>> couldn't confirm any more information about it.  Is a scalable
>> > >>> kmeans++ in
>> > >>>>>> the works? (I know research on the subject is quite new)
>> > >>>>>>
>> > >>>>>> Thanks!
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> Mattie Whitmore
>> > >>>>>> Mathematician/IR&D Software Engineer
>> > >>>>>> HARRIS  Corporation - Advanced Information Solutions
>> > >>>>>> 301.837.5278
>> > >>>>>> mwhitmor@harris.com<ma...@harris.com>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >
>> >
>> >
>> >
>>
>
>

Re: Mahout-279/kmeans++

Posted by Ted Dunning <te...@gmail.com>.

But columns aren't what I would expect you to want labeled.  I think that
row labels might be nicer.  Happily, each named vector has a name for the
entire vector as well.

On Thu, Aug 30, 2012 at 2:48 PM, Ted Dunning <te...@gmail.com> wrote:

> The input to the BallKmeans is actually not a matrix.  It is an
> Iterable<MatrixSlice>.  This can be a matrix since a matrix implements
> this.
>
> So one way to deal with this is to build your own Iterable and put
> NamedVectors into it.  NamedVector retain labels as you want.
>
>
> On Thu, Aug 30, 2012 at 12:53 PM, Whitmore, Mattie <mw...@harris.com>wrote:
>
>> I need to be using the matrices for BallKmeans.  Can matrices be named?
>> By this I mean can I assign a column of my matrix to be the "name" of each
>> row?
>>
>> Thanks!
>>
>> -----Original Message-----
>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>> Sent: Wednesday, August 29, 2012 12:17 PM
>> To: user@mahout.apache.org
>> Subject: Re: Mahout-279/kmeans++
>>
>> Yes.  The ball k-means implementation does use weights to indicate
>> multiple
>> vectors.
>>
>> The implementation is definitely ready to test.  I would be slightly
>> surprised if it has absolutely zero issues, but your feedback on such
>> issues would help them get fixed much sooner than others.
>>
>> On Wed, Aug 29, 2012 at 10:37 AM, Whitmore, Mattie <mwhitmor@harris.com
>> >wrote:
>>
>> > I re-ran the canopy-kmeans analytic, this time with unique names, I lost
>> > more points in the resulting clusters ( total points in the clusters =
>> > 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5).  The
>> total
>> > number of data points fed into the algorithm is 53365862 -- so even
>> v0.5 is
>> > missing 14% of the data.
>> >
>> > I'm thinking if I weight these dense vectors with a weight equal to the
>> > number of identical vectors in the set that could work -- Ball Kmeans
>> seems
>> > to do this.  Is this a correct interpretation of how to use weights in
>> Ball
>> > Kmeans, and is Ball Kmeans ready enough to be used/tested?
>> >
>> > Thanks
>> >
>> > -----Original Message-----
>> > From: Paritosh Ranjan [mailto:pranjan@xebia.com]
>> > Sent: Thursday, August 23, 2012 12:34 PM
>> > To: user@mahout.apache.org
>> > Subject: Re: Mahout-279/kmeans++
>> >
>> > clusterDump works in memory, and there are no plans yet to make it
>> > distributed ( or not in memory ). See thishttps://
>> > issues.apache.org/*jira*/browse/MAHOUT-940
>> >
>> > clusterpp has an option for distributed processing, so you can process
>> any
>> > amount of data with it.
>> >
>> > On 23-08-2012 19:55, Whitmore, Mattie wrote:
>> > > Yes, unique names will be my next plan -- I just can't kick off that
>> job
>> > until after the weekend.  If this makes no difference I will also try
>> the
>> > noise idea, and I'll follow up about both.
>> > >
>> > > My next question is regarding clusterDump.  Is there a way to run this
>> > in parallel? I have found some code to execute in java (the preferable
>> > method for me) but I would like the method to be faster and not in
>> memory.
>> >  Is this a possibility? Or in the works?
>> > >
>> > > Thanks!
>> > >
>> > > -----Original Message-----
>> > > From: Paritosh Ranjan [mailto:pranjan@xebia.com]
>> > > Sent: Wednesday, August 22, 2012 9:09 PM
>> > > To: user@mahout.apache.org
>> > > Subject: Re: Mahout-279/kmeans++
>> > >
>> > > Can you also try to provide distinct names to vectors and then
>> cluster?
>> > > It should not have any affect, but would be good to know the behavior.
>> > >
>> > > On 22-08-2012 23:10, Whitmore, Mattie wrote:
>> > >> Yes, I have data which is exactly the same.  If I give every vector a
>> > name which is distinct (albeit the data point is the same as other
>> points
>> > in the set) will this keep the algorithm from dropping non-distinct
>> > vectors/data points (which is what I THINK but have yet to verify is
>> what
>> > is going on)?
>> > >>
>> > >> Thanks,
>> > >>
>> > >> Mattie
>> > >>
>> > >> -----Original Message-----
>> > >> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>> > >> Sent: Wednesday, August 22, 2012 1:18 PM
>> > >> To: user@mahout.apache.org
>> > >> Subject: Re: Mahout-279/kmeans++
>> > >>
>> > >> Just an off thought, do you have duplicate input points?
>> > >>
>> > >> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <
>> mwhitmor@harris.com
>> > >wrote:
>> > >>
>> > >>> ... I have also verified by running canopy multiple times with 0.5
>> and
>> > 0.7
>> > >>> that there is a continual discrepancy between the two clustering
>> > versions.
>> > >>>    The max/min vectors in a cluster using 0.5 is: 19192158/215  and
>> > 0.7 is:
>> > >>> 921998/5.  They should not necessarily be the same, since I am using
>> > canopy
>> > >>> clustering to find initial centroids, however I would think they
>> would
>> > have
>> > >>> the same sum, which they do not (45901885 vs 1599154).
>> > >>>
>> > >>> Here is the method I am running:
>> > >>>
>> > >>> public static void KmeansClusteringCanopy(String outputDir, String
>> T,
>> > >>> String itMax)
>> > >>>                           throws IOException, InterruptedException,
>> > >>> ClassNotFoundException,
>> > >>>                           InstantiationException,
>> > IllegalAccessException {
>> > >>>
>> > >>>                   Configuration conf = new Configuration();
>> > >>>
>> > >>>                   DistanceMeasure measure = new
>> > EuclideanDistanceMeasure();
>> > >>>
>> > >>>                   Path vectorsFolder = new Path(outputDir,
>> "vectors");
>> > >>>                   Path clusterCenters = new Path(outputDir +
>> > >>> "-canopy/centriods");
>> > >>>                   Path clusterOutput = new Path(outputDir +
>> > >>> "-canopy/clusters");
>> > >>>
>> > >>>                   // create canopies instead of initial vectors
>> > >>>                   CanopyDriver.run(conf, vectorsFolder,
>> clusterCenters,
>> > >>> measure,
>> > >>>                                   Double.parseDouble(T),
>> > >>> Double.parseDouble(T), false, 0, false);
>> > >>>
>> > >>>
>> > >>>                   // kmeans cluster operation
>> > >>>                   KMeansDriver.run(conf, vectorsFolder, new
>> > >>> Path(clusterCenters,
>> > >>>                                   "clusters-0-final/part-r-00000"),
>> > >>> clusterOutput, measure, 0.01,
>> > >>>                                   Integer.parseInt(itMax), true,
>> 0.0,
>> > false);
>> > >>>
>> > >>>
>> > >>>                   //post process by putting completed clusters into
>> > their
>> > >>> own files.
>> > >>>
>> ClusterOutputPostProcessorDriver.run(clusterOutput,
>> > >>>                                   new
>> > >>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
>> > >>>
>> > >>>           }
>> > >>>
>> > >>> What do you think?
>> > >>>
>> > >>> On another but related note: Is there a plan to have a method -- say
>> > >>> ClusterOutputPostProcessorDriver -- which when run outputs the
>> vectors
>> > >>> within clusters as well as a separate folder containing pruned
>> > outliers?
>> > >>>
>> > >>> Thanks!
>> > >>>
>> > >>> Mattie
>> > >>>
>> > >>> -----Original Message-----
>> > >>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
>> > >>> Sent: Friday, August 17, 2012 12:16 PM
>> > >>> To: user@mahout.apache.org
>> > >>> Subject: Re: Mahout-279/kmeans++
>> > >>>
>> > >>> The clustering algorithm has also changed internally. So, expect the
>> > >>> results to be different ( and better ).
>> > >>>
>> > >>> I can think of one reason for this behavior. Maybe lots of clusters
>> are
>> > >>> having only one vector inside it, and, AFAIK, clusterdumper will not
>> > >>> output any cluster with single vector.
>> > >>> So, I think, its clusterdumper which is doing the invisible
>> "pruning" (
>> > >>> by not ouputting clusters with single vectors ).
>> > >>>
>> > >>> Can you cross check the output once with
>> > ClusterOutputPostProcessorDriver?
>> > >>>
>> > >>> No, no tool can output the pruned vectors. The only way to see all
>> > >>> vectors assigned to any cluster is to set
>> > clusterClassificationThreshold
>> > >>> to 0.
>> > >>>
>> > >>> If you still face the problem, then please provide the parameters
>> with
>> > >>> which you are calling kmeans.
>> > >>>
>> > >>> Regarding "I should also mention I have vectors which are exactly
>> the
>> > >>> same (even their names), perhaps they are the ones being pruned, is
>> > that
>> > >>> possible? "
>> > >>>
>> > >>> The name of the vector has nothing to do with clustering, I am not
>> sure
>> > >>> whether it will have any effect when clusterdumper is in action. So,
>> > >>> crosschecking with ClusterOutputPostProcessorDriver will answer
>> this.
>> > >>>
>> > >>> Good luck.
>> > >>> Paritosh
>> > >>>
>> > >>> On 17-08-2012 21:07, Whitmore, Mattie wrote:
>> > >>>> Sure, I have a dataset which I wish to cluster using Kmeans.
>> >  Previously
>> > >>> (v0.5) when I did a clusterdump the total amount of vectors within
>> the
>> > >>> resultant clusters was the same as the total amount fed to the
>> > algorithm.
>> > >>>    I wish this to be the case when clustering with v0.7.  The only
>> > change in
>> > >>> the algorithm is clusterClassificationThreshold,  I set this value
>> to
>> > be 0
>> > >>> so that it will in fact cluster all vectors in the dataset.
>> > >>>> My logic here was no vector should have a probability of being in
>> some
>> > >>> cluster less than 0 and therefore all vectors should cluster.
>> > >>>> However after running a clusterdump I find that vectors (1/3
>> roughly)
>> > >>> have been pruned.
>> > >>>> Is this a bug, or me just not understanding the new capabilities?
>> > >>>>
>> > >>>> I should also mention I have vectors which are exactly the same
>> (even
>> > >>> their names), perhaps they are the ones being pruned, is that
>> possible?
>> > >>>> Another question if I may: I will eventually want to use the
>> pruning
>> > >>> capabilities, does the ClusterOutputPostProcessorDriver method (or a
>> > >>> similar method) have the capability of outputting the pruned vectors
>> > into a
>> > >>> folder?
>> > >>>> Thanks! Please let me know if I'm still not being clear enough.
>> > >>>>
>> > >>>> Mattie
>> > >>>>
>> > >>>> -----Original Message-----
>> > >>>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
>> > >>>> Sent: Friday, August 17, 2012 11:20 AM
>> > >>>> To: user@mahout.apache.org
>> > >>>> Subject: Re: Mahout-279/kmeans++
>> > >>>>
>> > >>>> clusterClassificationThreshold is for outlier removal, and this is
>> the
>> > >>> way it should be used.
>> > >>>> Can you provide some more information about your job and the way
>> you
>> > are
>> > >>> calling it?
>> > >>>> And if I look at the code, the vector should be clustered even if
>> the
>> > >>> pdf is 0. The method which decides whether the vector should be
>> > assigned to
>> > >>> a particular cluster or not -
>> > >>>> /**
>> > >>>>        * Decides whether the vector should be classified or not
>> based
>> > on
>> > >>> the max pdf
>> > >>>>        * value of the clusters and threshold value.
>> > >>>>        *
>> > >>>>        * @return whether the vector should be classified or not.
>> > >>>>        */
>> > >>>>       private static boolean shouldClassify(Vector pdfPerCluster,
>> > Double
>> > >>> clusterClassificationThreshold) {
>> > >>>>         return pdfPerCluster.maxValue() >=
>> > clusterClassificationThreshold;
>> > >>>>       }
>> > >>>>
>> > >>>> On 17-08-2012 20:06, Whitmore, Mattie wrote:
>> > >>>>
>> > >>>>> Hi Ted,
>> > >>>>>
>> > >>>>> Yes this is great!  I hope to start working with this algorithm in
>> > the
>> > >>> next couple weeks.
>> > >>>>> I have a question about the 0.7 implementation of kmeans and the
>> > >>> clusterClassificationThreshold,  I have this value set at zero, but
>> the
>> > >>> output is still showing that about 1/3 of my data is not assigned
>> to a
>> > >>> cluster in my output.  Am I using this value incorrectly?  I did a
>> > >>> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned
>> > despite
>> > >>> the clusterClassificationThreshold = 0.
>> > >>>>> Thanks,
>> > >>>>>
>> > >>>>> Mattie
>> > >>>>>
>> > >>>>>
>> > >>>>> -----Original Message-----
>> > >>>>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>> > >>>>> Sent: Wednesday, August 15, 2012 5:20 PM
>> > >>>>> To: user@mahout.apache.org
>> > >>>>> Subject: Re: Mahout-279/kmeans++
>> > >>>>>
>> > >>>>> Mattie,
>> > >>>>>
>> > >>>>> Would this help?
>> > >>>>>
>> > >>>>>
>> > >>>
>> >
>> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
>> > >>>>> and
>> > >>>>>
>> > >>>>>
>> > >>>
>> >
>> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
>> > >>>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <
>> > mwhitmor@harris.com
>> > >>>> wrote:
>> > >>>>>> Hi!
>> > >>>>>>
>> > >>>>>> I have been using RandomSeedGenerator, and was hoping it had a
>> patch
>> > >>> like
>> > >>>>>> that described in Mahout-279 since I want only 10 vectors out of
>> a
>> > set
>> > >>> of
>> > >>>>>> more than 100,000,000.  I have been using canopy clustering for
>> > better
>> > >>>>>> results, but still need to do a few passes of kmeans to
>> determine my
>> > >>> T, and
>> > >>>>>> the random seed does take a long time.
>> > >>>>>>
>> > >>>>>> The comments say that you are working on a kmeans++, I searched
>> > around
>> > >>> but
>> > >>>>>> couldn't confirm any more information about it.  Is a scalable
>> > >>> kmeans++ in
>> > >>>>>> the works? (I know research on the subject is quite new)
>> > >>>>>>
>> > >>>>>> Thanks!
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> Mattie Whitmore
>> > >>>>>> Mathematician/IR&D Software Engineer
>> > >>>>>> HARRIS  Corporation - Advanced Information Solutions
>> > >>>>>> 301.837.5278
>> > >>>>>> mwhitmor@harris.com<ma...@harris.com>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >
>> >
>> >
>> >
>>
>
>

Re: Mahout-279/kmeans++

Posted by Ted Dunning <te...@gmail.com>.

The input to the BallKmeans is actually not a matrix.  It is an
Iterable<MatrixSlice>.  This can be a matrix since a matrix implements
this.

So one way to deal with this is to build your own Iterable and put
NamedVectors into it.  NamedVector retain labels as you want.

On Thu, Aug 30, 2012 at 12:53 PM, Whitmore, Mattie <mw...@harris.com>wrote:

> I need to be using the matrices for BallKmeans.  Can matrices be named? By
> this I mean can I assign a column of my matrix to be the "name" of each row?
>
> Thanks!
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Wednesday, August 29, 2012 12:17 PM
> To: user@mahout.apache.org
> Subject: Re: Mahout-279/kmeans++
>
> Yes.  The ball k-means implementation does use weights to indicate multiple
> vectors.
>
> The implementation is definitely ready to test.  I would be slightly
> surprised if it has absolutely zero issues, but your feedback on such
> issues would help them get fixed much sooner than others.
>
> On Wed, Aug 29, 2012 at 10:37 AM, Whitmore, Mattie <mwhitmor@harris.com
> >wrote:
>
> > I re-ran the canopy-kmeans analytic, this time with unique names, I lost
> > more points in the resulting clusters ( total points in the clusters =
> > 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5).  The
> total
> > number of data points fed into the algorithm is 53365862 -- so even v0.5
> is
> > missing 14% of the data.
> >
> > I'm thinking if I weight these dense vectors with a weight equal to the
> > number of identical vectors in the set that could work -- Ball Kmeans
> seems
> > to do this.  Is this a correct interpretation of how to use weights in
> Ball
> > Kmeans, and is Ball Kmeans ready enough to be used/tested?
> >
> > Thanks
> >
> > -----Original Message-----
> > From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> > Sent: Thursday, August 23, 2012 12:34 PM
> > To: user@mahout.apache.org
> > Subject: Re: Mahout-279/kmeans++
> >
> > clusterDump works in memory, and there are no plans yet to make it
> > distributed ( or not in memory ). See thishttps://
> > issues.apache.org/*jira*/browse/MAHOUT-940
> >
> > clusterpp has an option for distributed processing, so you can process
> any
> > amount of data with it.
> >
> > On 23-08-2012 19:55, Whitmore, Mattie wrote:
> > > Yes, unique names will be my next plan -- I just can't kick off that
> job
> > until after the weekend.  If this makes no difference I will also try the
> > noise idea, and I'll follow up about both.
> > >
> > > My next question is regarding clusterDump.  Is there a way to run this
> > in parallel? I have found some code to execute in java (the preferable
> > method for me) but I would like the method to be faster and not in
> memory.
> >  Is this a possibility? Or in the works?
> > >
> > > Thanks!
> > >
> > > -----Original Message-----
> > > From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> > > Sent: Wednesday, August 22, 2012 9:09 PM
> > > To: user@mahout.apache.org
> > > Subject: Re: Mahout-279/kmeans++
> > >
> > > Can you also try to provide distinct names to vectors and then cluster?
> > > It should not have any affect, but would be good to know the behavior.
> > >
> > > On 22-08-2012 23:10, Whitmore, Mattie wrote:
> > >> Yes, I have data which is exactly the same.  If I give every vector a
> > name which is distinct (albeit the data point is the same as other points
> > in the set) will this keep the algorithm from dropping non-distinct
> > vectors/data points (which is what I THINK but have yet to verify is what
> > is going on)?
> > >>
> > >> Thanks,
> > >>
> > >> Mattie
> > >>
> > >> -----Original Message-----
> > >> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > >> Sent: Wednesday, August 22, 2012 1:18 PM
> > >> To: user@mahout.apache.org
> > >> Subject: Re: Mahout-279/kmeans++
> > >>
> > >> Just an off thought, do you have duplicate input points?
> > >>
> > >> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <
> mwhitmor@harris.com
> > >wrote:
> > >>
> > >>> ... I have also verified by running canopy multiple times with 0.5
> and
> > 0.7
> > >>> that there is a continual discrepancy between the two clustering
> > versions.
> > >>>    The max/min vectors in a cluster using 0.5 is: 19192158/215  and
> > 0.7 is:
> > >>> 921998/5.  They should not necessarily be the same, since I am using
> > canopy
> > >>> clustering to find initial centroids, however I would think they
> would
> > have
> > >>> the same sum, which they do not (45901885 vs 1599154).
> > >>>
> > >>> Here is the method I am running:
> > >>>
> > >>> public static void KmeansClusteringCanopy(String outputDir, String T,
> > >>> String itMax)
> > >>>                           throws IOException, InterruptedException,
> > >>> ClassNotFoundException,
> > >>>                           InstantiationException,
> > IllegalAccessException {
> > >>>
> > >>>                   Configuration conf = new Configuration();
> > >>>
> > >>>                   DistanceMeasure measure = new
> > EuclideanDistanceMeasure();
> > >>>
> > >>>                   Path vectorsFolder = new Path(outputDir,
> "vectors");
> > >>>                   Path clusterCenters = new Path(outputDir +
> > >>> "-canopy/centriods");
> > >>>                   Path clusterOutput = new Path(outputDir +
> > >>> "-canopy/clusters");
> > >>>
> > >>>                   // create canopies instead of initial vectors
> > >>>                   CanopyDriver.run(conf, vectorsFolder,
> clusterCenters,
> > >>> measure,
> > >>>                                   Double.parseDouble(T),
> > >>> Double.parseDouble(T), false, 0, false);
> > >>>
> > >>>
> > >>>                   // kmeans cluster operation
> > >>>                   KMeansDriver.run(conf, vectorsFolder, new
> > >>> Path(clusterCenters,
> > >>>                                   "clusters-0-final/part-r-00000"),
> > >>> clusterOutput, measure, 0.01,
> > >>>                                   Integer.parseInt(itMax), true, 0.0,
> > false);
> > >>>
> > >>>
> > >>>                   //post process by putting completed clusters into
> > their
> > >>> own files.
> > >>>                   ClusterOutputPostProcessorDriver.run(clusterOutput,
> > >>>                                   new
> > >>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
> > >>>
> > >>>           }
> > >>>
> > >>> What do you think?
> > >>>
> > >>> On another but related note: Is there a plan to have a method -- say
> > >>> ClusterOutputPostProcessorDriver -- which when run outputs the
> vectors
> > >>> within clusters as well as a separate folder containing pruned
> > outliers?
> > >>>
> > >>> Thanks!
> > >>>
> > >>> Mattie
> > >>>
> > >>> -----Original Message-----
> > >>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> > >>> Sent: Friday, August 17, 2012 12:16 PM
> > >>> To: user@mahout.apache.org
> > >>> Subject: Re: Mahout-279/kmeans++
> > >>>
> > >>> The clustering algorithm has also changed internally. So, expect the
> > >>> results to be different ( and better ).
> > >>>
> > >>> I can think of one reason for this behavior. Maybe lots of clusters
> are
> > >>> having only one vector inside it, and, AFAIK, clusterdumper will not
> > >>> output any cluster with single vector.
> > >>> So, I think, its clusterdumper which is doing the invisible
> "pruning" (
> > >>> by not ouputting clusters with single vectors ).
> > >>>
> > >>> Can you cross check the output once with
> > ClusterOutputPostProcessorDriver?
> > >>>
> > >>> No, no tool can output the pruned vectors. The only way to see all
> > >>> vectors assigned to any cluster is to set
> > clusterClassificationThreshold
> > >>> to 0.
> > >>>
> > >>> If you still face the problem, then please provide the parameters
> with
> > >>> which you are calling kmeans.
> > >>>
> > >>> Regarding "I should also mention I have vectors which are exactly the
> > >>> same (even their names), perhaps they are the ones being pruned, is
> > that
> > >>> possible? "
> > >>>
> > >>> The name of the vector has nothing to do with clustering, I am not
> sure
> > >>> whether it will have any effect when clusterdumper is in action. So,
> > >>> crosschecking with ClusterOutputPostProcessorDriver will answer this.
> > >>>
> > >>> Good luck.
> > >>> Paritosh
> > >>>
> > >>> On 17-08-2012 21:07, Whitmore, Mattie wrote:
> > >>>> Sure, I have a dataset which I wish to cluster using Kmeans.
> >  Previously
> > >>> (v0.5) when I did a clusterdump the total amount of vectors within
> the
> > >>> resultant clusters was the same as the total amount fed to the
> > algorithm.
> > >>>    I wish this to be the case when clustering with v0.7.  The only
> > change in
> > >>> the algorithm is clusterClassificationThreshold,  I set this value to
> > be 0
> > >>> so that it will in fact cluster all vectors in the dataset.
> > >>>> My logic here was no vector should have a probability of being in
> some
> > >>> cluster less than 0 and therefore all vectors should cluster.
> > >>>> However after running a clusterdump I find that vectors (1/3
> roughly)
> > >>> have been pruned.
> > >>>> Is this a bug, or me just not understanding the new capabilities?
> > >>>>
> > >>>> I should also mention I have vectors which are exactly the same
> (even
> > >>> their names), perhaps they are the ones being pruned, is that
> possible?
> > >>>> Another question if I may: I will eventually want to use the pruning
> > >>> capabilities, does the ClusterOutputPostProcessorDriver method (or a
> > >>> similar method) have the capability of outputting the pruned vectors
> > into a
> > >>> folder?
> > >>>> Thanks! Please let me know if I'm still not being clear enough.
> > >>>>
> > >>>> Mattie
> > >>>>
> > >>>> -----Original Message-----
> > >>>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> > >>>> Sent: Friday, August 17, 2012 11:20 AM
> > >>>> To: user@mahout.apache.org
> > >>>> Subject: Re: Mahout-279/kmeans++
> > >>>>
> > >>>> clusterClassificationThreshold is for outlier removal, and this is
> the
> > >>> way it should be used.
> > >>>> Can you provide some more information about your job and the way you
> > are
> > >>> calling it?
> > >>>> And if I look at the code, the vector should be clustered even if
> the
> > >>> pdf is 0. The method which decides whether the vector should be
> > assigned to
> > >>> a particular cluster or not -
> > >>>> /**
> > >>>>        * Decides whether the vector should be classified or not
> based
> > on
> > >>> the max pdf
> > >>>>        * value of the clusters and threshold value.
> > >>>>        *
> > >>>>        * @return whether the vector should be classified or not.
> > >>>>        */
> > >>>>       private static boolean shouldClassify(Vector pdfPerCluster,
> > Double
> > >>> clusterClassificationThreshold) {
> > >>>>         return pdfPerCluster.maxValue() >=
> > clusterClassificationThreshold;
> > >>>>       }
> > >>>>
> > >>>> On 17-08-2012 20:06, Whitmore, Mattie wrote:
> > >>>>
> > >>>>> Hi Ted,
> > >>>>>
> > >>>>> Yes this is great!  I hope to start working with this algorithm in
> > the
> > >>> next couple weeks.
> > >>>>> I have a question about the 0.7 implementation of kmeans and the
> > >>> clusterClassificationThreshold,  I have this value set at zero, but
> the
> > >>> output is still showing that about 1/3 of my data is not assigned to
> a
> > >>> cluster in my output.  Am I using this value incorrectly?  I did a
> > >>> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned
> > despite
> > >>> the clusterClassificationThreshold = 0.
> > >>>>> Thanks,
> > >>>>>
> > >>>>> Mattie
> > >>>>>
> > >>>>>
> > >>>>> -----Original Message-----
> > >>>>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > >>>>> Sent: Wednesday, August 15, 2012 5:20 PM
> > >>>>> To: user@mahout.apache.org
> > >>>>> Subject: Re: Mahout-279/kmeans++
> > >>>>>
> > >>>>> Mattie,
> > >>>>>
> > >>>>> Would this help?
> > >>>>>
> > >>>>>
> > >>>
> >
> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
> > >>>>> and
> > >>>>>
> > >>>>>
> > >>>
> >
> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
> > >>>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <
> > mwhitmor@harris.com
> > >>>> wrote:
> > >>>>>> Hi!
> > >>>>>>
> > >>>>>> I have been using RandomSeedGenerator, and was hoping it had a
> patch
> > >>> like
> > >>>>>> that described in Mahout-279 since I want only 10 vectors out of a
> > set
> > >>> of
> > >>>>>> more than 100,000,000.  I have been using canopy clustering for
> > better
> > >>>>>> results, but still need to do a few passes of kmeans to determine
> my
> > >>> T, and
> > >>>>>> the random seed does take a long time.
> > >>>>>>
> > >>>>>> The comments say that you are working on a kmeans++, I searched
> > around
> > >>> but
> > >>>>>> couldn't confirm any more information about it.  Is a scalable
> > >>> kmeans++ in
> > >>>>>> the works? (I know research on the subject is quite new)
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Mattie Whitmore
> > >>>>>> Mathematician/IR&D Software Engineer
> > >>>>>> HARRIS  Corporation - Advanced Information Solutions
> > >>>>>> 301.837.5278
> > >>>>>> mwhitmor@harris.com<ma...@harris.com>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >
> >
> >
> >
>

RE: Mahout-279/kmeans++

Posted by "Whitmore, Mattie" <mw...@harris.com>.

I need to be using the matrices for BallKmeans.  Can matrices be named? By this I mean can I assign a column of my matrix to be the "name" of each row?

Thanks!

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Wednesday, August 29, 2012 12:17 PM
To: user@mahout.apache.org
Subject: Re: Mahout-279/kmeans++

Yes.  The ball k-means implementation does use weights to indicate multiple
vectors.

The implementation is definitely ready to test.  I would be slightly
surprised if it has absolutely zero issues, but your feedback on such
issues would help them get fixed much sooner than others.

On Wed, Aug 29, 2012 at 10:37 AM, Whitmore, Mattie <mw...@harris.com>wrote:

> I re-ran the canopy-kmeans analytic, this time with unique names, I lost
> more points in the resulting clusters ( total points in the clusters =
> 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5).  The total
> number of data points fed into the algorithm is 53365862 -- so even v0.5 is
> missing 14% of the data.
>
> I'm thinking if I weight these dense vectors with a weight equal to the
> number of identical vectors in the set that could work -- Ball Kmeans seems
> to do this.  Is this a correct interpretation of how to use weights in Ball
> Kmeans, and is Ball Kmeans ready enough to be used/tested?
>
> Thanks
>
> -----Original Message-----
> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> Sent: Thursday, August 23, 2012 12:34 PM
> To: user@mahout.apache.org
> Subject: Re: Mahout-279/kmeans++
>
> clusterDump works in memory, and there are no plans yet to make it
> distributed ( or not in memory ). See thishttps://
> issues.apache.org/*jira*/browse/MAHOUT-940
>
> clusterpp has an option for distributed processing, so you can process any
> amount of data with it.
>
> On 23-08-2012 19:55, Whitmore, Mattie wrote:
> > Yes, unique names will be my next plan -- I just can't kick off that job
> until after the weekend.  If this makes no difference I will also try the
> noise idea, and I'll follow up about both.
> >
> > My next question is regarding clusterDump.  Is there a way to run this
> in parallel? I have found some code to execute in java (the preferable
> method for me) but I would like the method to be faster and not in memory.
>  Is this a possibility? Or in the works?
> >
> > Thanks!
> >
> > -----Original Message-----
> > From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> > Sent: Wednesday, August 22, 2012 9:09 PM
> > To: user@mahout.apache.org
> > Subject: Re: Mahout-279/kmeans++
> >
> > Can you also try to provide distinct names to vectors and then cluster?
> > It should not have any affect, but would be good to know the behavior.
> >
> > On 22-08-2012 23:10, Whitmore, Mattie wrote:
> >> Yes, I have data which is exactly the same.  If I give every vector a
> name which is distinct (albeit the data point is the same as other points
> in the set) will this keep the algorithm from dropping non-distinct
> vectors/data points (which is what I THINK but have yet to verify is what
> is going on)?
> >>
> >> Thanks,
> >>
> >> Mattie
> >>
> >> -----Original Message-----
> >> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> >> Sent: Wednesday, August 22, 2012 1:18 PM
> >> To: user@mahout.apache.org
> >> Subject: Re: Mahout-279/kmeans++
> >>
> >> Just an off thought, do you have duplicate input points?
> >>
> >> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <mwhitmor@harris.com
> >wrote:
> >>
> >>> ... I have also verified by running canopy multiple times with 0.5 and
> 0.7
> >>> that there is a continual discrepancy between the two clustering
> versions.
> >>>    The max/min vectors in a cluster using 0.5 is: 19192158/215  and
> 0.7 is:
> >>> 921998/5.  They should not necessarily be the same, since I am using
> canopy
> >>> clustering to find initial centroids, however I would think they would
> have
> >>> the same sum, which they do not (45901885 vs 1599154).
> >>>
> >>> Here is the method I am running:
> >>>
> >>> public static void KmeansClusteringCanopy(String outputDir, String T,
> >>> String itMax)
> >>>                           throws IOException, InterruptedException,
> >>> ClassNotFoundException,
> >>>                           InstantiationException,
> IllegalAccessException {
> >>>
> >>>                   Configuration conf = new Configuration();
> >>>
> >>>                   DistanceMeasure measure = new
> EuclideanDistanceMeasure();
> >>>
> >>>                   Path vectorsFolder = new Path(outputDir, "vectors");
> >>>                   Path clusterCenters = new Path(outputDir +
> >>> "-canopy/centriods");
> >>>                   Path clusterOutput = new Path(outputDir +
> >>> "-canopy/clusters");
> >>>
> >>>                   // create canopies instead of initial vectors
> >>>                   CanopyDriver.run(conf, vectorsFolder, clusterCenters,
> >>> measure,
> >>>                                   Double.parseDouble(T),
> >>> Double.parseDouble(T), false, 0, false);
> >>>
> >>>
> >>>                   // kmeans cluster operation
> >>>                   KMeansDriver.run(conf, vectorsFolder, new
> >>> Path(clusterCenters,
> >>>                                   "clusters-0-final/part-r-00000"),
> >>> clusterOutput, measure, 0.01,
> >>>                                   Integer.parseInt(itMax), true, 0.0,
> false);
> >>>
> >>>
> >>>                   //post process by putting completed clusters into
> their
> >>> own files.
> >>>                   ClusterOutputPostProcessorDriver.run(clusterOutput,
> >>>                                   new
> >>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
> >>>
> >>>           }
> >>>
> >>> What do you think?
> >>>
> >>> On another but related note: Is there a plan to have a method -- say
> >>> ClusterOutputPostProcessorDriver -- which when run outputs the vectors
> >>> within clusters as well as a separate folder containing pruned
> outliers?
> >>>
> >>> Thanks!
> >>>
> >>> Mattie
> >>>
> >>> -----Original Message-----
> >>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> >>> Sent: Friday, August 17, 2012 12:16 PM
> >>> To: user@mahout.apache.org
> >>> Subject: Re: Mahout-279/kmeans++
> >>>
> >>> The clustering algorithm has also changed internally. So, expect the
> >>> results to be different ( and better ).
> >>>
> >>> I can think of one reason for this behavior. Maybe lots of clusters are
> >>> having only one vector inside it, and, AFAIK, clusterdumper will not
> >>> output any cluster with single vector.
> >>> So, I think, its clusterdumper which is doing the invisible "pruning" (
> >>> by not ouputting clusters with single vectors ).
> >>>
> >>> Can you cross check the output once with
> ClusterOutputPostProcessorDriver?
> >>>
> >>> No, no tool can output the pruned vectors. The only way to see all
> >>> vectors assigned to any cluster is to set
> clusterClassificationThreshold
> >>> to 0.
> >>>
> >>> If you still face the problem, then please provide the parameters with
> >>> which you are calling kmeans.
> >>>
> >>> Regarding "I should also mention I have vectors which are exactly the
> >>> same (even their names), perhaps they are the ones being pruned, is
> that
> >>> possible? "
> >>>
> >>> The name of the vector has nothing to do with clustering, I am not sure
> >>> whether it will have any effect when clusterdumper is in action. So,
> >>> crosschecking with ClusterOutputPostProcessorDriver will answer this.
> >>>
> >>> Good luck.
> >>> Paritosh
> >>>
> >>> On 17-08-2012 21:07, Whitmore, Mattie wrote:
> >>>> Sure, I have a dataset which I wish to cluster using Kmeans.
>  Previously
> >>> (v0.5) when I did a clusterdump the total amount of vectors within the
> >>> resultant clusters was the same as the total amount fed to the
> algorithm.
> >>>    I wish this to be the case when clustering with v0.7.  The only
> change in
> >>> the algorithm is clusterClassificationThreshold,  I set this value to
> be 0
> >>> so that it will in fact cluster all vectors in the dataset.
> >>>> My logic here was no vector should have a probability of being in some
> >>> cluster less than 0 and therefore all vectors should cluster.
> >>>> However after running a clusterdump I find that vectors (1/3 roughly)
> >>> have been pruned.
> >>>> Is this a bug, or me just not understanding the new capabilities?
> >>>>
> >>>> I should also mention I have vectors which are exactly the same (even
> >>> their names), perhaps they are the ones being pruned, is that possible?
> >>>> Another question if I may: I will eventually want to use the pruning
> >>> capabilities, does the ClusterOutputPostProcessorDriver method (or a
> >>> similar method) have the capability of outputting the pruned vectors
> into a
> >>> folder?
> >>>> Thanks! Please let me know if I'm still not being clear enough.
> >>>>
> >>>> Mattie
> >>>>
> >>>> -----Original Message-----
> >>>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> >>>> Sent: Friday, August 17, 2012 11:20 AM
> >>>> To: user@mahout.apache.org
> >>>> Subject: Re: Mahout-279/kmeans++
> >>>>
> >>>> clusterClassificationThreshold is for outlier removal, and this is the
> >>> way it should be used.
> >>>> Can you provide some more information about your job and the way you
> are
> >>> calling it?
> >>>> And if I look at the code, the vector should be clustered even if the
> >>> pdf is 0. The method which decides whether the vector should be
> assigned to
> >>> a particular cluster or not -
> >>>> /**
> >>>>        * Decides whether the vector should be classified or not based
> on
> >>> the max pdf
> >>>>        * value of the clusters and threshold value.
> >>>>        *
> >>>>        * @return whether the vector should be classified or not.
> >>>>        */
> >>>>       private static boolean shouldClassify(Vector pdfPerCluster,
> Double
> >>> clusterClassificationThreshold) {
> >>>>         return pdfPerCluster.maxValue() >=
> clusterClassificationThreshold;
> >>>>       }
> >>>>
> >>>> On 17-08-2012 20:06, Whitmore, Mattie wrote:
> >>>>
> >>>>> Hi Ted,
> >>>>>
> >>>>> Yes this is great!  I hope to start working with this algorithm in
> the
> >>> next couple weeks.
> >>>>> I have a question about the 0.7 implementation of kmeans and the
> >>> clusterClassificationThreshold,  I have this value set at zero, but the
> >>> output is still showing that about 1/3 of my data is not assigned to a
> >>> cluster in my output.  Am I using this value incorrectly?  I did a
> >>> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned
> despite
> >>> the clusterClassificationThreshold = 0.
> >>>>> Thanks,
> >>>>>
> >>>>> Mattie
> >>>>>
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> >>>>> Sent: Wednesday, August 15, 2012 5:20 PM
> >>>>> To: user@mahout.apache.org
> >>>>> Subject: Re: Mahout-279/kmeans++
> >>>>>
> >>>>> Mattie,
> >>>>>
> >>>>> Would this help?
> >>>>>
> >>>>>
> >>>
> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
> >>>>> and
> >>>>>
> >>>>>
> >>>
> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
> >>>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <
> mwhitmor@harris.com
> >>>> wrote:
> >>>>>> Hi!
> >>>>>>
> >>>>>> I have been using RandomSeedGenerator, and was hoping it had a patch
> >>> like
> >>>>>> that described in Mahout-279 since I want only 10 vectors out of a
> set
> >>> of
> >>>>>> more than 100,000,000.  I have been using canopy clustering for
> better
> >>>>>> results, but still need to do a few passes of kmeans to determine my
> >>> T, and
> >>>>>> the random seed does take a long time.
> >>>>>>
> >>>>>> The comments say that you are working on a kmeans++, I searched
> around
> >>> but
> >>>>>> couldn't confirm any more information about it.  Is a scalable
> >>> kmeans++ in
> >>>>>> the works? (I know research on the subject is quite new)
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Mattie Whitmore
> >>>>>> Mathematician/IR&D Software Engineer
> >>>>>> HARRIS  Corporation - Advanced Information Solutions
> >>>>>> 301.837.5278
> >>>>>> mwhitmor@harris.com<ma...@harris.com>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >
>
>
>

Re: Mahout-279/kmeans++

Posted by Ted Dunning <te...@gmail.com>.

Yes.  The ball k-means implementation does use weights to indicate multiple
vectors.

The implementation is definitely ready to test.  I would be slightly
surprised if it has absolutely zero issues, but your feedback on such
issues would help them get fixed much sooner than others.

On Wed, Aug 29, 2012 at 10:37 AM, Whitmore, Mattie <mw...@harris.com>wrote:

> I re-ran the canopy-kmeans analytic, this time with unique names, I lost
> more points in the resulting clusters ( total points in the clusters =
> 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5).  The total
> number of data points fed into the algorithm is 53365862 -- so even v0.5 is
> missing 14% of the data.
>
> I'm thinking if I weight these dense vectors with a weight equal to the
> number of identical vectors in the set that could work -- Ball Kmeans seems
> to do this.  Is this a correct interpretation of how to use weights in Ball
> Kmeans, and is Ball Kmeans ready enough to be used/tested?
>
> Thanks
>
> -----Original Message-----
> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> Sent: Thursday, August 23, 2012 12:34 PM
> To: user@mahout.apache.org
> Subject: Re: Mahout-279/kmeans++
>
> clusterDump works in memory, and there are no plans yet to make it
> distributed ( or not in memory ). See thishttps://
> issues.apache.org/*jira*/browse/MAHOUT-940
>
> clusterpp has an option for distributed processing, so you can process any
> amount of data with it.
>
> On 23-08-2012 19:55, Whitmore, Mattie wrote:
> > Yes, unique names will be my next plan -- I just can't kick off that job
> until after the weekend.  If this makes no difference I will also try the
> noise idea, and I'll follow up about both.
> >
> > My next question is regarding clusterDump.  Is there a way to run this
> in parallel? I have found some code to execute in java (the preferable
> method for me) but I would like the method to be faster and not in memory.
>  Is this a possibility? Or in the works?
> >
> > Thanks!
> >
> > -----Original Message-----
> > From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> > Sent: Wednesday, August 22, 2012 9:09 PM
> > To: user@mahout.apache.org
> > Subject: Re: Mahout-279/kmeans++
> >
> > Can you also try to provide distinct names to vectors and then cluster?
> > It should not have any affect, but would be good to know the behavior.
> >
> > On 22-08-2012 23:10, Whitmore, Mattie wrote:
> >> Yes, I have data which is exactly the same.  If I give every vector a
> name which is distinct (albeit the data point is the same as other points
> in the set) will this keep the algorithm from dropping non-distinct
> vectors/data points (which is what I THINK but have yet to verify is what
> is going on)?
> >>
> >> Thanks,
> >>
> >> Mattie
> >>
> >> -----Original Message-----
> >> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> >> Sent: Wednesday, August 22, 2012 1:18 PM
> >> To: user@mahout.apache.org
> >> Subject: Re: Mahout-279/kmeans++
> >>
> >> Just an off thought, do you have duplicate input points?
> >>
> >> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <mwhitmor@harris.com
> >wrote:
> >>
> >>> ... I have also verified by running canopy multiple times with 0.5 and
> 0.7
> >>> that there is a continual discrepancy between the two clustering
> versions.
> >>>    The max/min vectors in a cluster using 0.5 is: 19192158/215  and
> 0.7 is:
> >>> 921998/5.  They should not necessarily be the same, since I am using
> canopy
> >>> clustering to find initial centroids, however I would think they would
> have
> >>> the same sum, which they do not (45901885 vs 1599154).
> >>>
> >>> Here is the method I am running:
> >>>
> >>> public static void KmeansClusteringCanopy(String outputDir, String T,
> >>> String itMax)
> >>>                           throws IOException, InterruptedException,
> >>> ClassNotFoundException,
> >>>                           InstantiationException,
> IllegalAccessException {
> >>>
> >>>                   Configuration conf = new Configuration();
> >>>
> >>>                   DistanceMeasure measure = new
> EuclideanDistanceMeasure();
> >>>
> >>>                   Path vectorsFolder = new Path(outputDir, "vectors");
> >>>                   Path clusterCenters = new Path(outputDir +
> >>> "-canopy/centriods");
> >>>                   Path clusterOutput = new Path(outputDir +
> >>> "-canopy/clusters");
> >>>
> >>>                   // create canopies instead of initial vectors
> >>>                   CanopyDriver.run(conf, vectorsFolder, clusterCenters,
> >>> measure,
> >>>                                   Double.parseDouble(T),
> >>> Double.parseDouble(T), false, 0, false);
> >>>
> >>>
> >>>                   // kmeans cluster operation
> >>>                   KMeansDriver.run(conf, vectorsFolder, new
> >>> Path(clusterCenters,
> >>>                                   "clusters-0-final/part-r-00000"),
> >>> clusterOutput, measure, 0.01,
> >>>                                   Integer.parseInt(itMax), true, 0.0,
> false);
> >>>
> >>>
> >>>                   //post process by putting completed clusters into
> their
> >>> own files.
> >>>                   ClusterOutputPostProcessorDriver.run(clusterOutput,
> >>>                                   new
> >>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
> >>>
> >>>           }
> >>>
> >>> What do you think?
> >>>
> >>> On another but related note: Is there a plan to have a method -- say
> >>> ClusterOutputPostProcessorDriver -- which when run outputs the vectors
> >>> within clusters as well as a separate folder containing pruned
> outliers?
> >>>
> >>> Thanks!
> >>>
> >>> Mattie
> >>>
> >>> -----Original Message-----
> >>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> >>> Sent: Friday, August 17, 2012 12:16 PM
> >>> To: user@mahout.apache.org
> >>> Subject: Re: Mahout-279/kmeans++
> >>>
> >>> The clustering algorithm has also changed internally. So, expect the
> >>> results to be different ( and better ).
> >>>
> >>> I can think of one reason for this behavior. Maybe lots of clusters are
> >>> having only one vector inside it, and, AFAIK, clusterdumper will not
> >>> output any cluster with single vector.
> >>> So, I think, its clusterdumper which is doing the invisible "pruning" (
> >>> by not ouputting clusters with single vectors ).
> >>>
> >>> Can you cross check the output once with
> ClusterOutputPostProcessorDriver?
> >>>
> >>> No, no tool can output the pruned vectors. The only way to see all
> >>> vectors assigned to any cluster is to set
> clusterClassificationThreshold
> >>> to 0.
> >>>
> >>> If you still face the problem, then please provide the parameters with
> >>> which you are calling kmeans.
> >>>
> >>> Regarding "I should also mention I have vectors which are exactly the
> >>> same (even their names), perhaps they are the ones being pruned, is
> that
> >>> possible? "
> >>>
> >>> The name of the vector has nothing to do with clustering, I am not sure
> >>> whether it will have any effect when clusterdumper is in action. So,
> >>> crosschecking with ClusterOutputPostProcessorDriver will answer this.
> >>>
> >>> Good luck.
> >>> Paritosh
> >>>
> >>> On 17-08-2012 21:07, Whitmore, Mattie wrote:
> >>>> Sure, I have a dataset which I wish to cluster using Kmeans.
>  Previously
> >>> (v0.5) when I did a clusterdump the total amount of vectors within the
> >>> resultant clusters was the same as the total amount fed to the
> algorithm.
> >>>    I wish this to be the case when clustering with v0.7.  The only
> change in
> >>> the algorithm is clusterClassificationThreshold,  I set this value to
> be 0
> >>> so that it will in fact cluster all vectors in the dataset.
> >>>> My logic here was no vector should have a probability of being in some
> >>> cluster less than 0 and therefore all vectors should cluster.
> >>>> However after running a clusterdump I find that vectors (1/3 roughly)
> >>> have been pruned.
> >>>> Is this a bug, or me just not understanding the new capabilities?
> >>>>
> >>>> I should also mention I have vectors which are exactly the same (even
> >>> their names), perhaps they are the ones being pruned, is that possible?
> >>>> Another question if I may: I will eventually want to use the pruning
> >>> capabilities, does the ClusterOutputPostProcessorDriver method (or a
> >>> similar method) have the capability of outputting the pruned vectors
> into a
> >>> folder?
> >>>> Thanks! Please let me know if I'm still not being clear enough.
> >>>>
> >>>> Mattie
> >>>>
> >>>> -----Original Message-----
> >>>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> >>>> Sent: Friday, August 17, 2012 11:20 AM
> >>>> To: user@mahout.apache.org
> >>>> Subject: Re: Mahout-279/kmeans++
> >>>>
> >>>> clusterClassificationThreshold is for outlier removal, and this is the
> >>> way it should be used.
> >>>> Can you provide some more information about your job and the way you
> are
> >>> calling it?
> >>>> And if I look at the code, the vector should be clustered even if the
> >>> pdf is 0. The method which decides whether the vector should be
> assigned to
> >>> a particular cluster or not -
> >>>> /**
> >>>>        * Decides whether the vector should be classified or not based
> on
> >>> the max pdf
> >>>>        * value of the clusters and threshold value.
> >>>>        *
> >>>>        * @return whether the vector should be classified or not.
> >>>>        */
> >>>>       private static boolean shouldClassify(Vector pdfPerCluster,
> Double
> >>> clusterClassificationThreshold) {
> >>>>         return pdfPerCluster.maxValue() >=
> clusterClassificationThreshold;
> >>>>       }
> >>>>
> >>>> On 17-08-2012 20:06, Whitmore, Mattie wrote:
> >>>>
> >>>>> Hi Ted,
> >>>>>
> >>>>> Yes this is great!  I hope to start working with this algorithm in
> the
> >>> next couple weeks.
> >>>>> I have a question about the 0.7 implementation of kmeans and the
> >>> clusterClassificationThreshold,  I have this value set at zero, but the
> >>> output is still showing that about 1/3 of my data is not assigned to a
> >>> cluster in my output.  Am I using this value incorrectly?  I did a
> >>> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned
> despite
> >>> the clusterClassificationThreshold = 0.
> >>>>> Thanks,
> >>>>>
> >>>>> Mattie
> >>>>>
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> >>>>> Sent: Wednesday, August 15, 2012 5:20 PM
> >>>>> To: user@mahout.apache.org
> >>>>> Subject: Re: Mahout-279/kmeans++
> >>>>>
> >>>>> Mattie,
> >>>>>
> >>>>> Would this help?
> >>>>>
> >>>>>
> >>>
> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
> >>>>> and
> >>>>>
> >>>>>
> >>>
> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
> >>>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <
> mwhitmor@harris.com
> >>>> wrote:
> >>>>>> Hi!
> >>>>>>
> >>>>>> I have been using RandomSeedGenerator, and was hoping it had a patch
> >>> like
> >>>>>> that described in Mahout-279 since I want only 10 vectors out of a
> set
> >>> of
> >>>>>> more than 100,000,000.  I have been using canopy clustering for
> better
> >>>>>> results, but still need to do a few passes of kmeans to determine my
> >>> T, and
> >>>>>> the random seed does take a long time.
> >>>>>>
> >>>>>> The comments say that you are working on a kmeans++, I searched
> around
> >>> but
> >>>>>> couldn't confirm any more information about it.  Is a scalable
> >>> kmeans++ in
> >>>>>> the works? (I know research on the subject is quite new)
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Mattie Whitmore
> >>>>>> Mathematician/IR&D Software Engineer
> >>>>>> HARRIS  Corporation - Advanced Information Solutions
> >>>>>> 301.837.5278
> >>>>>> mwhitmor@harris.com<ma...@harris.com>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >
>
>
>

RE: Mahout-279/kmeans++

Posted by "Whitmore, Mattie" <mw...@harris.com>.

I re-ran the canopy-kmeans analytic, this time with unique names, I lost more points in the resulting clusters ( total points in the clusters = 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5).  The total number of data points fed into the algorithm is 53365862 -- so even v0.5 is missing 14% of the data.

I'm thinking if I weight these dense vectors with a weight equal to the number of identical vectors in the set that could work -- Ball Kmeans seems to do this.  Is this a correct interpretation of how to use weights in Ball Kmeans, and is Ball Kmeans ready enough to be used/tested?

Thanks

-----Original Message-----
From: Paritosh Ranjan [mailto:pranjan@xebia.com] 
Sent: Thursday, August 23, 2012 12:34 PM
To: user@mahout.apache.org
Subject: Re: Mahout-279/kmeans++

clusterDump works in memory, and there are no plans yet to make it distributed ( or not in memory ). See thishttps://issues.apache.org/*jira*/browse/MAHOUT-940

clusterpp has an option for distributed processing, so you can process any amount of data with it.

On 23-08-2012 19:55, Whitmore, Mattie wrote:
> Yes, unique names will be my next plan -- I just can't kick off that job until after the weekend.  If this makes no difference I will also try the noise idea, and I'll follow up about both.
>
> My next question is regarding clusterDump.  Is there a way to run this in parallel? I have found some code to execute in java (the preferable method for me) but I would like the method to be faster and not in memory.  Is this a possibility? Or in the works?
>
> Thanks!
>
> -----Original Message-----
> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> Sent: Wednesday, August 22, 2012 9:09 PM
> To: user@mahout.apache.org
> Subject: Re: Mahout-279/kmeans++
>
> Can you also try to provide distinct names to vectors and then cluster?
> It should not have any affect, but would be good to know the behavior.
>
> On 22-08-2012 23:10, Whitmore, Mattie wrote:
>> Yes, I have data which is exactly the same.  If I give every vector a name which is distinct (albeit the data point is the same as other points in the set) will this keep the algorithm from dropping non-distinct vectors/data points (which is what I THINK but have yet to verify is what is going on)?
>>
>> Thanks,
>>
>> Mattie
>>
>> -----Original Message-----
>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>> Sent: Wednesday, August 22, 2012 1:18 PM
>> To: user@mahout.apache.org
>> Subject: Re: Mahout-279/kmeans++
>>
>> Just an off thought, do you have duplicate input points?
>>
>> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <mw...@harris.com>wrote:
>>
>>> ... I have also verified by running canopy multiple times with 0.5 and 0.7
>>> that there is a continual discrepancy between the two clustering versions.
>>>    The max/min vectors in a cluster using 0.5 is: 19192158/215  and 0.7 is:
>>> 921998/5.  They should not necessarily be the same, since I am using canopy
>>> clustering to find initial centroids, however I would think they would have
>>> the same sum, which they do not (45901885 vs 1599154).
>>>
>>> Here is the method I am running:
>>>
>>> public static void KmeansClusteringCanopy(String outputDir, String T,
>>> String itMax)
>>>                           throws IOException, InterruptedException,
>>> ClassNotFoundException,
>>>                           InstantiationException, IllegalAccessException {
>>>
>>>                   Configuration conf = new Configuration();
>>>
>>>                   DistanceMeasure measure = new EuclideanDistanceMeasure();
>>>
>>>                   Path vectorsFolder = new Path(outputDir, "vectors");
>>>                   Path clusterCenters = new Path(outputDir +
>>> "-canopy/centriods");
>>>                   Path clusterOutput = new Path(outputDir +
>>> "-canopy/clusters");
>>>
>>>                   // create canopies instead of initial vectors
>>>                   CanopyDriver.run(conf, vectorsFolder, clusterCenters,
>>> measure,
>>>                                   Double.parseDouble(T),
>>> Double.parseDouble(T), false, 0, false);
>>>
>>>
>>>                   // kmeans cluster operation
>>>                   KMeansDriver.run(conf, vectorsFolder, new
>>> Path(clusterCenters,
>>>                                   "clusters-0-final/part-r-00000"),
>>> clusterOutput, measure, 0.01,
>>>                                   Integer.parseInt(itMax), true, 0.0, false);
>>>
>>>
>>>                   //post process by putting completed clusters into their
>>> own files.
>>>                   ClusterOutputPostProcessorDriver.run(clusterOutput,
>>>                                   new
>>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
>>>
>>>           }
>>>
>>> What do you think?
>>>
>>> On another but related note: Is there a plan to have a method -- say
>>> ClusterOutputPostProcessorDriver -- which when run outputs the vectors
>>> within clusters as well as a separate folder containing pruned outliers?
>>>
>>> Thanks!
>>>
>>> Mattie
>>>
>>> -----Original Message-----
>>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
>>> Sent: Friday, August 17, 2012 12:16 PM
>>> To: user@mahout.apache.org
>>> Subject: Re: Mahout-279/kmeans++
>>>
>>> The clustering algorithm has also changed internally. So, expect the
>>> results to be different ( and better ).
>>>
>>> I can think of one reason for this behavior. Maybe lots of clusters are
>>> having only one vector inside it, and, AFAIK, clusterdumper will not
>>> output any cluster with single vector.
>>> So, I think, its clusterdumper which is doing the invisible "pruning" (
>>> by not ouputting clusters with single vectors ).
>>>
>>> Can you cross check the output once with ClusterOutputPostProcessorDriver?
>>>
>>> No, no tool can output the pruned vectors. The only way to see all
>>> vectors assigned to any cluster is to set clusterClassificationThreshold
>>> to 0.
>>>
>>> If you still face the problem, then please provide the parameters with
>>> which you are calling kmeans.
>>>
>>> Regarding "I should also mention I have vectors which are exactly the
>>> same (even their names), perhaps they are the ones being pruned, is that
>>> possible? "
>>>
>>> The name of the vector has nothing to do with clustering, I am not sure
>>> whether it will have any effect when clusterdumper is in action. So,
>>> crosschecking with ClusterOutputPostProcessorDriver will answer this.
>>>
>>> Good luck.
>>> Paritosh
>>>
>>> On 17-08-2012 21:07, Whitmore, Mattie wrote:
>>>> Sure, I have a dataset which I wish to cluster using Kmeans.  Previously
>>> (v0.5) when I did a clusterdump the total amount of vectors within the
>>> resultant clusters was the same as the total amount fed to the algorithm.
>>>    I wish this to be the case when clustering with v0.7.  The only change in
>>> the algorithm is clusterClassificationThreshold,  I set this value to be 0
>>> so that it will in fact cluster all vectors in the dataset.
>>>> My logic here was no vector should have a probability of being in some
>>> cluster less than 0 and therefore all vectors should cluster.
>>>> However after running a clusterdump I find that vectors (1/3 roughly)
>>> have been pruned.
>>>> Is this a bug, or me just not understanding the new capabilities?
>>>>
>>>> I should also mention I have vectors which are exactly the same (even
>>> their names), perhaps they are the ones being pruned, is that possible?
>>>> Another question if I may: I will eventually want to use the pruning
>>> capabilities, does the ClusterOutputPostProcessorDriver method (or a
>>> similar method) have the capability of outputting the pruned vectors into a
>>> folder?
>>>> Thanks! Please let me know if I'm still not being clear enough.
>>>>
>>>> Mattie
>>>>
>>>> -----Original Message-----
>>>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
>>>> Sent: Friday, August 17, 2012 11:20 AM
>>>> To: user@mahout.apache.org
>>>> Subject: Re: Mahout-279/kmeans++
>>>>
>>>> clusterClassificationThreshold is for outlier removal, and this is the
>>> way it should be used.
>>>> Can you provide some more information about your job and the way you are
>>> calling it?
>>>> And if I look at the code, the vector should be clustered even if the
>>> pdf is 0. The method which decides whether the vector should be assigned to
>>> a particular cluster or not -
>>>> /**
>>>>        * Decides whether the vector should be classified or not based on
>>> the max pdf
>>>>        * value of the clusters and threshold value.
>>>>        *
>>>>        * @return whether the vector should be classified or not.
>>>>        */
>>>>       private static boolean shouldClassify(Vector pdfPerCluster, Double
>>> clusterClassificationThreshold) {
>>>>         return pdfPerCluster.maxValue() >= clusterClassificationThreshold;
>>>>       }
>>>>
>>>> On 17-08-2012 20:06, Whitmore, Mattie wrote:
>>>>
>>>>> Hi Ted,
>>>>>
>>>>> Yes this is great!  I hope to start working with this algorithm in the
>>> next couple weeks.
>>>>> I have a question about the 0.7 implementation of kmeans and the
>>> clusterClassificationThreshold,  I have this value set at zero, but the
>>> output is still showing that about 1/3 of my data is not assigned to a
>>> cluster in my output.  Am I using this value incorrectly?  I did a
>>> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite
>>> the clusterClassificationThreshold = 0.
>>>>> Thanks,
>>>>>
>>>>> Mattie
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>>>>> Sent: Wednesday, August 15, 2012 5:20 PM
>>>>> To: user@mahout.apache.org
>>>>> Subject: Re: Mahout-279/kmeans++
>>>>>
>>>>> Mattie,
>>>>>
>>>>> Would this help?
>>>>>
>>>>>
>>> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
>>>>> and
>>>>>
>>>>>
>>> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
>>>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <mwhitmor@harris.com
>>>> wrote:
>>>>>> Hi!
>>>>>>
>>>>>> I have been using RandomSeedGenerator, and was hoping it had a patch
>>> like
>>>>>> that described in Mahout-279 since I want only 10 vectors out of a set
>>> of
>>>>>> more than 100,000,000.  I have been using canopy clustering for better
>>>>>> results, but still need to do a few passes of kmeans to determine my
>>> T, and
>>>>>> the random seed does take a long time.
>>>>>>
>>>>>> The comments say that you are working on a kmeans++, I searched around
>>> but
>>>>>> couldn't confirm any more information about it.  Is a scalable
>>> kmeans++ in
>>>>>> the works? (I know research on the subject is quite new)
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Mattie Whitmore
>>>>>> Mathematician/IR&D Software Engineer
>>>>>> HARRIS  Corporation - Advanced Information Solutions
>>>>>> 301.837.5278
>>>>>> mwhitmor@harris.com<ma...@harris.com>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>

Re: Mahout-279/kmeans++

Posted by Paritosh Ranjan <pr...@xebia.com>.

clusterDump works in memory, and there are no plans yet to make it distributed ( or not in memory ). See thishttps://issues.apache.org/*jira*/browse/MAHOUT-940

clusterpp has an option for distributed processing, so you can process any amount of data with it.

On 23-08-2012 19:55, Whitmore, Mattie wrote:
> Yes, unique names will be my next plan -- I just can't kick off that job until after the weekend.  If this makes no difference I will also try the noise idea, and I'll follow up about both.
>
> My next question is regarding clusterDump.  Is there a way to run this in parallel? I have found some code to execute in java (the preferable method for me) but I would like the method to be faster and not in memory.  Is this a possibility? Or in the works?
>
> Thanks!
>
> -----Original Message-----
> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> Sent: Wednesday, August 22, 2012 9:09 PM
> To: user@mahout.apache.org
> Subject: Re: Mahout-279/kmeans++
>
> Can you also try to provide distinct names to vectors and then cluster?
> It should not have any affect, but would be good to know the behavior.
>
> On 22-08-2012 23:10, Whitmore, Mattie wrote:
>> Yes, I have data which is exactly the same.  If I give every vector a name which is distinct (albeit the data point is the same as other points in the set) will this keep the algorithm from dropping non-distinct vectors/data points (which is what I THINK but have yet to verify is what is going on)?
>>
>> Thanks,
>>
>> Mattie
>>
>> -----Original Message-----
>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>> Sent: Wednesday, August 22, 2012 1:18 PM
>> To: user@mahout.apache.org
>> Subject: Re: Mahout-279/kmeans++
>>
>> Just an off thought, do you have duplicate input points?
>>
>> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <mw...@harris.com>wrote:
>>
>>> ... I have also verified by running canopy multiple times with 0.5 and 0.7
>>> that there is a continual discrepancy between the two clustering versions.
>>>    The max/min vectors in a cluster using 0.5 is: 19192158/215  and 0.7 is:
>>> 921998/5.  They should not necessarily be the same, since I am using canopy
>>> clustering to find initial centroids, however I would think they would have
>>> the same sum, which they do not (45901885 vs 1599154).
>>>
>>> Here is the method I am running:
>>>
>>> public static void KmeansClusteringCanopy(String outputDir, String T,
>>> String itMax)
>>>                           throws IOException, InterruptedException,
>>> ClassNotFoundException,
>>>                           InstantiationException, IllegalAccessException {
>>>
>>>                   Configuration conf = new Configuration();
>>>
>>>                   DistanceMeasure measure = new EuclideanDistanceMeasure();
>>>
>>>                   Path vectorsFolder = new Path(outputDir, "vectors");
>>>                   Path clusterCenters = new Path(outputDir +
>>> "-canopy/centriods");
>>>                   Path clusterOutput = new Path(outputDir +
>>> "-canopy/clusters");
>>>
>>>                   // create canopies instead of initial vectors
>>>                   CanopyDriver.run(conf, vectorsFolder, clusterCenters,
>>> measure,
>>>                                   Double.parseDouble(T),
>>> Double.parseDouble(T), false, 0, false);
>>>
>>>
>>>                   // kmeans cluster operation
>>>                   KMeansDriver.run(conf, vectorsFolder, new
>>> Path(clusterCenters,
>>>                                   "clusters-0-final/part-r-00000"),
>>> clusterOutput, measure, 0.01,
>>>                                   Integer.parseInt(itMax), true, 0.0, false);
>>>
>>>
>>>                   //post process by putting completed clusters into their
>>> own files.
>>>                   ClusterOutputPostProcessorDriver.run(clusterOutput,
>>>                                   new
>>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
>>>
>>>           }
>>>
>>> What do you think?
>>>
>>> On another but related note: Is there a plan to have a method -- say
>>> ClusterOutputPostProcessorDriver -- which when run outputs the vectors
>>> within clusters as well as a separate folder containing pruned outliers?
>>>
>>> Thanks!
>>>
>>> Mattie
>>>
>>> -----Original Message-----
>>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
>>> Sent: Friday, August 17, 2012 12:16 PM
>>> To: user@mahout.apache.org
>>> Subject: Re: Mahout-279/kmeans++
>>>
>>> The clustering algorithm has also changed internally. So, expect the
>>> results to be different ( and better ).
>>>
>>> I can think of one reason for this behavior. Maybe lots of clusters are
>>> having only one vector inside it, and, AFAIK, clusterdumper will not
>>> output any cluster with single vector.
>>> So, I think, its clusterdumper which is doing the invisible "pruning" (
>>> by not ouputting clusters with single vectors ).
>>>
>>> Can you cross check the output once with ClusterOutputPostProcessorDriver?
>>>
>>> No, no tool can output the pruned vectors. The only way to see all
>>> vectors assigned to any cluster is to set clusterClassificationThreshold
>>> to 0.
>>>
>>> If you still face the problem, then please provide the parameters with
>>> which you are calling kmeans.
>>>
>>> Regarding "I should also mention I have vectors which are exactly the
>>> same (even their names), perhaps they are the ones being pruned, is that
>>> possible? "
>>>
>>> The name of the vector has nothing to do with clustering, I am not sure
>>> whether it will have any effect when clusterdumper is in action. So,
>>> crosschecking with ClusterOutputPostProcessorDriver will answer this.
>>>
>>> Good luck.
>>> Paritosh
>>>
>>> On 17-08-2012 21:07, Whitmore, Mattie wrote:
>>>> Sure, I have a dataset which I wish to cluster using Kmeans.  Previously
>>> (v0.5) when I did a clusterdump the total amount of vectors within the
>>> resultant clusters was the same as the total amount fed to the algorithm.
>>>    I wish this to be the case when clustering with v0.7.  The only change in
>>> the algorithm is clusterClassificationThreshold,  I set this value to be 0
>>> so that it will in fact cluster all vectors in the dataset.
>>>> My logic here was no vector should have a probability of being in some
>>> cluster less than 0 and therefore all vectors should cluster.
>>>> However after running a clusterdump I find that vectors (1/3 roughly)
>>> have been pruned.
>>>> Is this a bug, or me just not understanding the new capabilities?
>>>>
>>>> I should also mention I have vectors which are exactly the same (even
>>> their names), perhaps they are the ones being pruned, is that possible?
>>>> Another question if I may: I will eventually want to use the pruning
>>> capabilities, does the ClusterOutputPostProcessorDriver method (or a
>>> similar method) have the capability of outputting the pruned vectors into a
>>> folder?
>>>> Thanks! Please let me know if I'm still not being clear enough.
>>>>
>>>> Mattie
>>>>
>>>> -----Original Message-----
>>>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
>>>> Sent: Friday, August 17, 2012 11:20 AM
>>>> To: user@mahout.apache.org
>>>> Subject: Re: Mahout-279/kmeans++
>>>>
>>>> clusterClassificationThreshold is for outlier removal, and this is the
>>> way it should be used.
>>>> Can you provide some more information about your job and the way you are
>>> calling it?
>>>> And if I look at the code, the vector should be clustered even if the
>>> pdf is 0. The method which decides whether the vector should be assigned to
>>> a particular cluster or not -
>>>> /**
>>>>        * Decides whether the vector should be classified or not based on
>>> the max pdf
>>>>        * value of the clusters and threshold value.
>>>>        *
>>>>        * @return whether the vector should be classified or not.
>>>>        */
>>>>       private static boolean shouldClassify(Vector pdfPerCluster, Double
>>> clusterClassificationThreshold) {
>>>>         return pdfPerCluster.maxValue() >= clusterClassificationThreshold;
>>>>       }
>>>>
>>>> On 17-08-2012 20:06, Whitmore, Mattie wrote:
>>>>
>>>>> Hi Ted,
>>>>>
>>>>> Yes this is great!  I hope to start working with this algorithm in the
>>> next couple weeks.
>>>>> I have a question about the 0.7 implementation of kmeans and the
>>> clusterClassificationThreshold,  I have this value set at zero, but the
>>> output is still showing that about 1/3 of my data is not assigned to a
>>> cluster in my output.  Am I using this value incorrectly?  I did a
>>> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite
>>> the clusterClassificationThreshold = 0.
>>>>> Thanks,
>>>>>
>>>>> Mattie
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>>>>> Sent: Wednesday, August 15, 2012 5:20 PM
>>>>> To: user@mahout.apache.org
>>>>> Subject: Re: Mahout-279/kmeans++
>>>>>
>>>>> Mattie,
>>>>>
>>>>> Would this help?
>>>>>
>>>>>
>>> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
>>>>> and
>>>>>
>>>>>
>>> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
>>>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <mwhitmor@harris.com
>>>> wrote:
>>>>>> Hi!
>>>>>>
>>>>>> I have been using RandomSeedGenerator, and was hoping it had a patch
>>> like
>>>>>> that described in Mahout-279 since I want only 10 vectors out of a set
>>> of
>>>>>> more than 100,000,000.  I have been using canopy clustering for better
>>>>>> results, but still need to do a few passes of kmeans to determine my
>>> T, and
>>>>>> the random seed does take a long time.
>>>>>>
>>>>>> The comments say that you are working on a kmeans++, I searched around
>>> but
>>>>>> couldn't confirm any more information about it.  Is a scalable
>>> kmeans++ in
>>>>>> the works? (I know research on the subject is quite new)
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Mattie Whitmore
>>>>>> Mathematician/IR&D Software Engineer
>>>>>> HARRIS  Corporation - Advanced Information Solutions
>>>>>> 301.837.5278
>>>>>> mwhitmor@harris.com<ma...@harris.com>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>

RE: Mahout-279/kmeans++

Posted by "Whitmore, Mattie" <mw...@harris.com>.

Yes, unique names will be my next plan -- I just can't kick off that job until after the weekend.  If this makes no difference I will also try the noise idea, and I'll follow up about both.

My next question is regarding clusterDump.  Is there a way to run this in parallel? I have found some code to execute in java (the preferable method for me) but I would like the method to be faster and not in memory.  Is this a possibility? Or in the works?

Thanks!

-----Original Message-----
From: Paritosh Ranjan [mailto:pranjan@xebia.com] 
Sent: Wednesday, August 22, 2012 9:09 PM
To: user@mahout.apache.org
Subject: Re: Mahout-279/kmeans++

Can you also try to provide distinct names to vectors and then cluster?
It should not have any affect, but would be good to know the behavior.

On 22-08-2012 23:10, Whitmore, Mattie wrote:
> Yes, I have data which is exactly the same.  If I give every vector a name which is distinct (albeit the data point is the same as other points in the set) will this keep the algorithm from dropping non-distinct vectors/data points (which is what I THINK but have yet to verify is what is going on)?
>
> Thanks,
>
> Mattie
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Wednesday, August 22, 2012 1:18 PM
> To: user@mahout.apache.org
> Subject: Re: Mahout-279/kmeans++
>
> Just an off thought, do you have duplicate input points?
>
> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <mw...@harris.com>wrote:
>
>> ... I have also verified by running canopy multiple times with 0.5 and 0.7
>> that there is a continual discrepancy between the two clustering versions.
>>   The max/min vectors in a cluster using 0.5 is: 19192158/215  and 0.7 is:
>> 921998/5.  They should not necessarily be the same, since I am using canopy
>> clustering to find initial centroids, however I would think they would have
>> the same sum, which they do not (45901885 vs 1599154).
>>
>> Here is the method I am running:
>>
>> public static void KmeansClusteringCanopy(String outputDir, String T,
>> String itMax)
>>                          throws IOException, InterruptedException,
>> ClassNotFoundException,
>>                          InstantiationException, IllegalAccessException {
>>
>>                  Configuration conf = new Configuration();
>>
>>                  DistanceMeasure measure = new EuclideanDistanceMeasure();
>>
>>                  Path vectorsFolder = new Path(outputDir, "vectors");
>>                  Path clusterCenters = new Path(outputDir +
>> "-canopy/centriods");
>>                  Path clusterOutput = new Path(outputDir +
>> "-canopy/clusters");
>>
>>                  // create canopies instead of initial vectors
>>                  CanopyDriver.run(conf, vectorsFolder, clusterCenters,
>> measure,
>>                                  Double.parseDouble(T),
>> Double.parseDouble(T), false, 0, false);
>>
>>
>>                  // kmeans cluster operation
>>                  KMeansDriver.run(conf, vectorsFolder, new
>> Path(clusterCenters,
>>                                  "clusters-0-final/part-r-00000"),
>> clusterOutput, measure, 0.01,
>>                                  Integer.parseInt(itMax), true, 0.0, false);
>>
>>
>>                  //post process by putting completed clusters into their
>> own files.
>>                  ClusterOutputPostProcessorDriver.run(clusterOutput,
>>                                  new
>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
>>
>>          }
>>
>> What do you think?
>>
>> On another but related note: Is there a plan to have a method -- say
>> ClusterOutputPostProcessorDriver -- which when run outputs the vectors
>> within clusters as well as a separate folder containing pruned outliers?
>>
>> Thanks!
>>
>> Mattie
>>
>> -----Original Message-----
>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
>> Sent: Friday, August 17, 2012 12:16 PM
>> To: user@mahout.apache.org
>> Subject: Re: Mahout-279/kmeans++
>>
>> The clustering algorithm has also changed internally. So, expect the
>> results to be different ( and better ).
>>
>> I can think of one reason for this behavior. Maybe lots of clusters are
>> having only one vector inside it, and, AFAIK, clusterdumper will not
>> output any cluster with single vector.
>> So, I think, its clusterdumper which is doing the invisible "pruning" (
>> by not ouputting clusters with single vectors ).
>>
>> Can you cross check the output once with ClusterOutputPostProcessorDriver?
>>
>> No, no tool can output the pruned vectors. The only way to see all
>> vectors assigned to any cluster is to set clusterClassificationThreshold
>> to 0.
>>
>> If you still face the problem, then please provide the parameters with
>> which you are calling kmeans.
>>
>> Regarding "I should also mention I have vectors which are exactly the
>> same (even their names), perhaps they are the ones being pruned, is that
>> possible? "
>>
>> The name of the vector has nothing to do with clustering, I am not sure
>> whether it will have any effect when clusterdumper is in action. So,
>> crosschecking with ClusterOutputPostProcessorDriver will answer this.
>>
>> Good luck.
>> Paritosh
>>
>> On 17-08-2012 21:07, Whitmore, Mattie wrote:
>>> Sure, I have a dataset which I wish to cluster using Kmeans.  Previously
>> (v0.5) when I did a clusterdump the total amount of vectors within the
>> resultant clusters was the same as the total amount fed to the algorithm.
>>   I wish this to be the case when clustering with v0.7.  The only change in
>> the algorithm is clusterClassificationThreshold,  I set this value to be 0
>> so that it will in fact cluster all vectors in the dataset.
>>> My logic here was no vector should have a probability of being in some
>> cluster less than 0 and therefore all vectors should cluster.
>>> However after running a clusterdump I find that vectors (1/3 roughly)
>> have been pruned.
>>> Is this a bug, or me just not understanding the new capabilities?
>>>
>>> I should also mention I have vectors which are exactly the same (even
>> their names), perhaps they are the ones being pruned, is that possible?
>>> Another question if I may: I will eventually want to use the pruning
>> capabilities, does the ClusterOutputPostProcessorDriver method (or a
>> similar method) have the capability of outputting the pruned vectors into a
>> folder?
>>> Thanks! Please let me know if I'm still not being clear enough.
>>>
>>> Mattie
>>>
>>> -----Original Message-----
>>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
>>> Sent: Friday, August 17, 2012 11:20 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: Mahout-279/kmeans++
>>>
>>> clusterClassificationThreshold is for outlier removal, and this is the
>> way it should be used.
>>> Can you provide some more information about your job and the way you are
>> calling it?
>>> And if I look at the code, the vector should be clustered even if the
>> pdf is 0. The method which decides whether the vector should be assigned to
>> a particular cluster or not -
>>> /**
>>>       * Decides whether the vector should be classified or not based on
>> the max pdf
>>>       * value of the clusters and threshold value.
>>>       *
>>>       * @return whether the vector should be classified or not.
>>>       */
>>>      private static boolean shouldClassify(Vector pdfPerCluster, Double
>> clusterClassificationThreshold) {
>>>        return pdfPerCluster.maxValue() >= clusterClassificationThreshold;
>>>      }
>>>
>>> On 17-08-2012 20:06, Whitmore, Mattie wrote:
>>>
>>>> Hi Ted,
>>>>
>>>> Yes this is great!  I hope to start working with this algorithm in the
>> next couple weeks.
>>>> I have a question about the 0.7 implementation of kmeans and the
>> clusterClassificationThreshold,  I have this value set at zero, but the
>> output is still showing that about 1/3 of my data is not assigned to a
>> cluster in my output.  Am I using this value incorrectly?  I did a
>> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite
>> the clusterClassificationThreshold = 0.
>>>>
>>>> Thanks,
>>>>
>>>> Mattie
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>>>> Sent: Wednesday, August 15, 2012 5:20 PM
>>>> To: user@mahout.apache.org
>>>> Subject: Re: Mahout-279/kmeans++
>>>>
>>>> Mattie,
>>>>
>>>> Would this help?
>>>>
>>>>
>> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
>>>> and
>>>>
>>>>
>> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
>>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <mwhitmor@harris.com
>>> wrote:
>>>>> Hi!
>>>>>
>>>>> I have been using RandomSeedGenerator, and was hoping it had a patch
>> like
>>>>> that described in Mahout-279 since I want only 10 vectors out of a set
>> of
>>>>> more than 100,000,000.  I have been using canopy clustering for better
>>>>> results, but still need to do a few passes of kmeans to determine my
>> T, and
>>>>> the random seed does take a long time.
>>>>>
>>>>> The comments say that you are working on a kmeans++, I searched around
>> but
>>>>> couldn't confirm any more information about it.  Is a scalable
>> kmeans++ in
>>>>> the works? (I know research on the subject is quite new)
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>> Mattie Whitmore
>>>>> Mathematician/IR&D Software Engineer
>>>>> HARRIS  Corporation - Advanced Information Solutions
>>>>> 301.837.5278
>>>>> mwhitmor@harris.com<ma...@harris.com>
>>>>>
>>>>>
>>>>>
>>>>>

Re: Mahout-279/kmeans++

Posted by Paritosh Ranjan <pr...@xebia.com>.

Can you also try to provide distinct names to vectors and then cluster?
It should not have any affect, but would be good to know the behavior.

On 22-08-2012 23:10, Whitmore, Mattie wrote:
> Yes, I have data which is exactly the same.  If I give every vector a name which is distinct (albeit the data point is the same as other points in the set) will this keep the algorithm from dropping non-distinct vectors/data points (which is what I THINK but have yet to verify is what is going on)?
>
> Thanks,
>
> Mattie
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Wednesday, August 22, 2012 1:18 PM
> To: user@mahout.apache.org
> Subject: Re: Mahout-279/kmeans++
>
> Just an off thought, do you have duplicate input points?
>
> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <mw...@harris.com>wrote:
>
>> ... I have also verified by running canopy multiple times with 0.5 and 0.7
>> that there is a continual discrepancy between the two clustering versions.
>>   The max/min vectors in a cluster using 0.5 is: 19192158/215  and 0.7 is:
>> 921998/5.  They should not necessarily be the same, since I am using canopy
>> clustering to find initial centroids, however I would think they would have
>> the same sum, which they do not (45901885 vs 1599154).
>>
>> Here is the method I am running:
>>
>> public static void KmeansClusteringCanopy(String outputDir, String T,
>> String itMax)
>>                          throws IOException, InterruptedException,
>> ClassNotFoundException,
>>                          InstantiationException, IllegalAccessException {
>>
>>                  Configuration conf = new Configuration();
>>
>>                  DistanceMeasure measure = new EuclideanDistanceMeasure();
>>
>>                  Path vectorsFolder = new Path(outputDir, "vectors");
>>                  Path clusterCenters = new Path(outputDir +
>> "-canopy/centriods");
>>                  Path clusterOutput = new Path(outputDir +
>> "-canopy/clusters");
>>
>>                  // create canopies instead of initial vectors
>>                  CanopyDriver.run(conf, vectorsFolder, clusterCenters,
>> measure,
>>                                  Double.parseDouble(T),
>> Double.parseDouble(T), false, 0, false);
>>
>>
>>                  // kmeans cluster operation
>>                  KMeansDriver.run(conf, vectorsFolder, new
>> Path(clusterCenters,
>>                                  "clusters-0-final/part-r-00000"),
>> clusterOutput, measure, 0.01,
>>                                  Integer.parseInt(itMax), true, 0.0, false);
>>
>>
>>                  //post process by putting completed clusters into their
>> own files.
>>                  ClusterOutputPostProcessorDriver.run(clusterOutput,
>>                                  new
>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
>>
>>          }
>>
>> What do you think?
>>
>> On another but related note: Is there a plan to have a method -- say
>> ClusterOutputPostProcessorDriver -- which when run outputs the vectors
>> within clusters as well as a separate folder containing pruned outliers?
>>
>> Thanks!
>>
>> Mattie
>>
>> -----Original Message-----
>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
>> Sent: Friday, August 17, 2012 12:16 PM
>> To: user@mahout.apache.org
>> Subject: Re: Mahout-279/kmeans++
>>
>> The clustering algorithm has also changed internally. So, expect the
>> results to be different ( and better ).
>>
>> I can think of one reason for this behavior. Maybe lots of clusters are
>> having only one vector inside it, and, AFAIK, clusterdumper will not
>> output any cluster with single vector.
>> So, I think, its clusterdumper which is doing the invisible "pruning" (
>> by not ouputting clusters with single vectors ).
>>
>> Can you cross check the output once with ClusterOutputPostProcessorDriver?
>>
>> No, no tool can output the pruned vectors. The only way to see all
>> vectors assigned to any cluster is to set clusterClassificationThreshold
>> to 0.
>>
>> If you still face the problem, then please provide the parameters with
>> which you are calling kmeans.
>>
>> Regarding "I should also mention I have vectors which are exactly the
>> same (even their names), perhaps they are the ones being pruned, is that
>> possible? "
>>
>> The name of the vector has nothing to do with clustering, I am not sure
>> whether it will have any effect when clusterdumper is in action. So,
>> crosschecking with ClusterOutputPostProcessorDriver will answer this.
>>
>> Good luck.
>> Paritosh
>>
>> On 17-08-2012 21:07, Whitmore, Mattie wrote:
>>> Sure, I have a dataset which I wish to cluster using Kmeans.  Previously
>> (v0.5) when I did a clusterdump the total amount of vectors within the
>> resultant clusters was the same as the total amount fed to the algorithm.
>>   I wish this to be the case when clustering with v0.7.  The only change in
>> the algorithm is clusterClassificationThreshold,  I set this value to be 0
>> so that it will in fact cluster all vectors in the dataset.
>>> My logic here was no vector should have a probability of being in some
>> cluster less than 0 and therefore all vectors should cluster.
>>> However after running a clusterdump I find that vectors (1/3 roughly)
>> have been pruned.
>>> Is this a bug, or me just not understanding the new capabilities?
>>>
>>> I should also mention I have vectors which are exactly the same (even
>> their names), perhaps they are the ones being pruned, is that possible?
>>> Another question if I may: I will eventually want to use the pruning
>> capabilities, does the ClusterOutputPostProcessorDriver method (or a
>> similar method) have the capability of outputting the pruned vectors into a
>> folder?
>>> Thanks! Please let me know if I'm still not being clear enough.
>>>
>>> Mattie
>>>
>>> -----Original Message-----
>>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
>>> Sent: Friday, August 17, 2012 11:20 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: Mahout-279/kmeans++
>>>
>>> clusterClassificationThreshold is for outlier removal, and this is the
>> way it should be used.
>>> Can you provide some more information about your job and the way you are
>> calling it?
>>> And if I look at the code, the vector should be clustered even if the
>> pdf is 0. The method which decides whether the vector should be assigned to
>> a particular cluster or not -
>>> /**
>>>       * Decides whether the vector should be classified or not based on
>> the max pdf
>>>       * value of the clusters and threshold value.
>>>       *
>>>       * @return whether the vector should be classified or not.
>>>       */
>>>      private static boolean shouldClassify(Vector pdfPerCluster, Double
>> clusterClassificationThreshold) {
>>>        return pdfPerCluster.maxValue() >= clusterClassificationThreshold;
>>>      }
>>>
>>> On 17-08-2012 20:06, Whitmore, Mattie wrote:
>>>
>>>> Hi Ted,
>>>>
>>>> Yes this is great!  I hope to start working with this algorithm in the
>> next couple weeks.
>>>> I have a question about the 0.7 implementation of kmeans and the
>> clusterClassificationThreshold,  I have this value set at zero, but the
>> output is still showing that about 1/3 of my data is not assigned to a
>> cluster in my output.  Am I using this value incorrectly?  I did a
>> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite
>> the clusterClassificationThreshold = 0.
>>>>
>>>> Thanks,
>>>>
>>>> Mattie
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>>>> Sent: Wednesday, August 15, 2012 5:20 PM
>>>> To: user@mahout.apache.org
>>>> Subject: Re: Mahout-279/kmeans++
>>>>
>>>> Mattie,
>>>>
>>>> Would this help?
>>>>
>>>>
>> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
>>>> and
>>>>
>>>>
>> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
>>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <mwhitmor@harris.com
>>> wrote:
>>>>> Hi!
>>>>>
>>>>> I have been using RandomSeedGenerator, and was hoping it had a patch
>> like
>>>>> that described in Mahout-279 since I want only 10 vectors out of a set
>> of
>>>>> more than 100,000,000.  I have been using canopy clustering for better
>>>>> results, but still need to do a few passes of kmeans to determine my
>> T, and
>>>>> the random seed does take a long time.
>>>>>
>>>>> The comments say that you are working on a kmeans++, I searched around
>> but
>>>>> couldn't confirm any more information about it.  Is a scalable
>> kmeans++ in
>>>>> the works? (I know research on the subject is quite new)
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>> Mattie Whitmore
>>>>> Mathematician/IR&D Software Engineer
>>>>> HARRIS  Corporation - Advanced Information Solutions
>>>>> 301.837.5278
>>>>> mwhitmor@harris.com<ma...@harris.com>
>>>>>
>>>>>
>>>>>
>>>>>

RE: Mahout-279/kmeans++

Posted by "Whitmore, Mattie" <mw...@harris.com>.

Yes, I have data which is exactly the same.  If I give every vector a name which is distinct (albeit the data point is the same as other points in the set) will this keep the algorithm from dropping non-distinct vectors/data points (which is what I THINK but have yet to verify is what is going on)?

Thanks,

Mattie

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Wednesday, August 22, 2012 1:18 PM
To: user@mahout.apache.org
Subject: Re: Mahout-279/kmeans++

Just an off thought, do you have duplicate input points?

On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <mw...@harris.com>wrote:

> ... I have also verified by running canopy multiple times with 0.5 and 0.7
> that there is a continual discrepancy between the two clustering versions.
>  The max/min vectors in a cluster using 0.5 is: 19192158/215  and 0.7 is:
> 921998/5.  They should not necessarily be the same, since I am using canopy
> clustering to find initial centroids, however I would think they would have
> the same sum, which they do not (45901885 vs 1599154).
>
> Here is the method I am running:
>
> public static void KmeansClusteringCanopy(String outputDir, String T,
> String itMax)
>                         throws IOException, InterruptedException,
> ClassNotFoundException,
>                         InstantiationException, IllegalAccessException {
>
>                 Configuration conf = new Configuration();
>
>                 DistanceMeasure measure = new EuclideanDistanceMeasure();
>
>                 Path vectorsFolder = new Path(outputDir, "vectors");
>                 Path clusterCenters = new Path(outputDir +
> "-canopy/centriods");
>                 Path clusterOutput = new Path(outputDir +
> "-canopy/clusters");
>
>                 // create canopies instead of initial vectors
>                 CanopyDriver.run(conf, vectorsFolder, clusterCenters,
> measure,
>                                 Double.parseDouble(T),
> Double.parseDouble(T), false, 0, false);
>
>
>                 // kmeans cluster operation
>                 KMeansDriver.run(conf, vectorsFolder, new
> Path(clusterCenters,
>                                 "clusters-0-final/part-r-00000"),
> clusterOutput, measure, 0.01,
>                                 Integer.parseInt(itMax), true, 0.0, false);
>
>
>                 //post process by putting completed clusters into their
> own files.
>                 ClusterOutputPostProcessorDriver.run(clusterOutput,
>                                 new
> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
>
>         }
>
> What do you think?
>
> On another but related note: Is there a plan to have a method -- say
> ClusterOutputPostProcessorDriver -- which when run outputs the vectors
> within clusters as well as a separate folder containing pruned outliers?
>
> Thanks!
>
> Mattie
>
> -----Original Message-----
> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> Sent: Friday, August 17, 2012 12:16 PM
> To: user@mahout.apache.org
> Subject: Re: Mahout-279/kmeans++
>
> The clustering algorithm has also changed internally. So, expect the
> results to be different ( and better ).
>
> I can think of one reason for this behavior. Maybe lots of clusters are
> having only one vector inside it, and, AFAIK, clusterdumper will not
> output any cluster with single vector.
> So, I think, its clusterdumper which is doing the invisible "pruning" (
> by not ouputting clusters with single vectors ).
>
> Can you cross check the output once with ClusterOutputPostProcessorDriver?
>
> No, no tool can output the pruned vectors. The only way to see all
> vectors assigned to any cluster is to set clusterClassificationThreshold
> to 0.
>
> If you still face the problem, then please provide the parameters with
> which you are calling kmeans.
>
> Regarding "I should also mention I have vectors which are exactly the
> same (even their names), perhaps they are the ones being pruned, is that
> possible? "
>
> The name of the vector has nothing to do with clustering, I am not sure
> whether it will have any effect when clusterdumper is in action. So,
> crosschecking with ClusterOutputPostProcessorDriver will answer this.
>
> Good luck.
> Paritosh
>
> On 17-08-2012 21:07, Whitmore, Mattie wrote:
> > Sure, I have a dataset which I wish to cluster using Kmeans.  Previously
> (v0.5) when I did a clusterdump the total amount of vectors within the
> resultant clusters was the same as the total amount fed to the algorithm.
>  I wish this to be the case when clustering with v0.7.  The only change in
> the algorithm is clusterClassificationThreshold,  I set this value to be 0
> so that it will in fact cluster all vectors in the dataset.
> >
> > My logic here was no vector should have a probability of being in some
> cluster less than 0 and therefore all vectors should cluster.
> >
> > However after running a clusterdump I find that vectors (1/3 roughly)
> have been pruned.
> >
> > Is this a bug, or me just not understanding the new capabilities?
> >
> > I should also mention I have vectors which are exactly the same (even
> their names), perhaps they are the ones being pruned, is that possible?
> >
> > Another question if I may: I will eventually want to use the pruning
> capabilities, does the ClusterOutputPostProcessorDriver method (or a
> similar method) have the capability of outputting the pruned vectors into a
> folder?
> >
> > Thanks! Please let me know if I'm still not being clear enough.
> >
> > Mattie
> >
> > -----Original Message-----
> > From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> > Sent: Friday, August 17, 2012 11:20 AM
> > To: user@mahout.apache.org
> > Subject: Re: Mahout-279/kmeans++
> >
> > clusterClassificationThreshold is for outlier removal, and this is the
> way it should be used.
> >
> > Can you provide some more information about your job and the way you are
> calling it?
> >
> > And if I look at the code, the vector should be clustered even if the
> pdf is 0. The method which decides whether the vector should be assigned to
> a particular cluster or not -
> >
> > /**
> >      * Decides whether the vector should be classified or not based on
> the max pdf
> >      * value of the clusters and threshold value.
> >      *
> >      * @return whether the vector should be classified or not.
> >      */
> >     private static boolean shouldClassify(Vector pdfPerCluster, Double
> clusterClassificationThreshold) {
> >       return pdfPerCluster.maxValue() >= clusterClassificationThreshold;
> >     }
> >
> > On 17-08-2012 20:06, Whitmore, Mattie wrote:
> >
> >> Hi Ted,
> >>
> >> Yes this is great!  I hope to start working with this algorithm in the
> next couple weeks.
> >>
> >> I have a question about the 0.7 implementation of kmeans and the
> clusterClassificationThreshold,  I have this value set at zero, but the
> output is still showing that about 1/3 of my data is not assigned to a
> cluster in my output.  Am I using this value incorrectly?  I did a
> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite
> the clusterClassificationThreshold = 0.
> >>
> >>
> >> Thanks,
> >>
> >> Mattie
> >>
> >>
> >> -----Original Message-----
> >> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> >> Sent: Wednesday, August 15, 2012 5:20 PM
> >> To: user@mahout.apache.org
> >> Subject: Re: Mahout-279/kmeans++
> >>
> >> Mattie,
> >>
> >> Would this help?
> >>
> >>
> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
> >>
> >> and
> >>
> >>
> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
> >>
> >> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <mwhitmor@harris.com
> >wrote:
> >>
> >>> Hi!
> >>>
> >>> I have been using RandomSeedGenerator, and was hoping it had a patch
> like
> >>> that described in Mahout-279 since I want only 10 vectors out of a set
> of
> >>> more than 100,000,000.  I have been using canopy clustering for better
> >>> results, but still need to do a few passes of kmeans to determine my
> T, and
> >>> the random seed does take a long time.
> >>>
> >>> The comments say that you are working on a kmeans++, I searched around
> but
> >>> couldn't confirm any more information about it.  Is a scalable
> kmeans++ in
> >>> the works? (I know research on the subject is quite new)
> >>>
> >>> Thanks!
> >>>
> >>>
> >>>
> >>> Mattie Whitmore
> >>> Mathematician/IR&D Software Engineer
> >>> HARRIS  Corporation - Advanced Information Solutions
> >>> 301.837.5278
> >>> mwhitmor@harris.com<ma...@harris.com>
> >>>
> >>>
> >>>
> >>>
> >
>

Re: Mahout-279/kmeans++

Posted by Ted Dunning <te...@gmail.com>.

Just an off thought, do you have duplicate input points?

On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <mw...@harris.com>wrote:

> ... I have also verified by running canopy multiple times with 0.5 and 0.7
> that there is a continual discrepancy between the two clustering versions.
>  The max/min vectors in a cluster using 0.5 is: 19192158/215  and 0.7 is:
> 921998/5.  They should not necessarily be the same, since I am using canopy
> clustering to find initial centroids, however I would think they would have
> the same sum, which they do not (45901885 vs 1599154).
>
> Here is the method I am running:
>
> public static void KmeansClusteringCanopy(String outputDir, String T,
> String itMax)
>                         throws IOException, InterruptedException,
> ClassNotFoundException,
>                         InstantiationException, IllegalAccessException {
>
>                 Configuration conf = new Configuration();
>
>                 DistanceMeasure measure = new EuclideanDistanceMeasure();
>
>                 Path vectorsFolder = new Path(outputDir, "vectors");
>                 Path clusterCenters = new Path(outputDir +
> "-canopy/centriods");
>                 Path clusterOutput = new Path(outputDir +
> "-canopy/clusters");
>
>                 // create canopies instead of initial vectors
>                 CanopyDriver.run(conf, vectorsFolder, clusterCenters,
> measure,
>                                 Double.parseDouble(T),
> Double.parseDouble(T), false, 0, false);
>
>
>                 // kmeans cluster operation
>                 KMeansDriver.run(conf, vectorsFolder, new
> Path(clusterCenters,
>                                 "clusters-0-final/part-r-00000"),
> clusterOutput, measure, 0.01,
>                                 Integer.parseInt(itMax), true, 0.0, false);
>
>
>                 //post process by putting completed clusters into their
> own files.
>                 ClusterOutputPostProcessorDriver.run(clusterOutput,
>                                 new
> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
>
>         }
>
> What do you think?
>
> On another but related note: Is there a plan to have a method -- say
> ClusterOutputPostProcessorDriver -- which when run outputs the vectors
> within clusters as well as a separate folder containing pruned outliers?
>
> Thanks!
>
> Mattie
>
> -----Original Message-----
> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> Sent: Friday, August 17, 2012 12:16 PM
> To: user@mahout.apache.org
> Subject: Re: Mahout-279/kmeans++
>
> The clustering algorithm has also changed internally. So, expect the
> results to be different ( and better ).
>
> I can think of one reason for this behavior. Maybe lots of clusters are
> having only one vector inside it, and, AFAIK, clusterdumper will not
> output any cluster with single vector.
> So, I think, its clusterdumper which is doing the invisible "pruning" (
> by not ouputting clusters with single vectors ).
>
> Can you cross check the output once with ClusterOutputPostProcessorDriver?
>
> No, no tool can output the pruned vectors. The only way to see all
> vectors assigned to any cluster is to set clusterClassificationThreshold
> to 0.
>
> If you still face the problem, then please provide the parameters with
> which you are calling kmeans.
>
> Regarding "I should also mention I have vectors which are exactly the
> same (even their names), perhaps they are the ones being pruned, is that
> possible? "
>
> The name of the vector has nothing to do with clustering, I am not sure
> whether it will have any effect when clusterdumper is in action. So,
> crosschecking with ClusterOutputPostProcessorDriver will answer this.
>
> Good luck.
> Paritosh
>
> On 17-08-2012 21:07, Whitmore, Mattie wrote:
> > Sure, I have a dataset which I wish to cluster using Kmeans.  Previously
> (v0.5) when I did a clusterdump the total amount of vectors within the
> resultant clusters was the same as the total amount fed to the algorithm.
>  I wish this to be the case when clustering with v0.7.  The only change in
> the algorithm is clusterClassificationThreshold,  I set this value to be 0
> so that it will in fact cluster all vectors in the dataset.
> >
> > My logic here was no vector should have a probability of being in some
> cluster less than 0 and therefore all vectors should cluster.
> >
> > However after running a clusterdump I find that vectors (1/3 roughly)
> have been pruned.
> >
> > Is this a bug, or me just not understanding the new capabilities?
> >
> > I should also mention I have vectors which are exactly the same (even
> their names), perhaps they are the ones being pruned, is that possible?
> >
> > Another question if I may: I will eventually want to use the pruning
> capabilities, does the ClusterOutputPostProcessorDriver method (or a
> similar method) have the capability of outputting the pruned vectors into a
> folder?
> >
> > Thanks! Please let me know if I'm still not being clear enough.
> >
> > Mattie
> >
> > -----Original Message-----
> > From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> > Sent: Friday, August 17, 2012 11:20 AM
> > To: user@mahout.apache.org
> > Subject: Re: Mahout-279/kmeans++
> >
> > clusterClassificationThreshold is for outlier removal, and this is the
> way it should be used.
> >
> > Can you provide some more information about your job and the way you are
> calling it?
> >
> > And if I look at the code, the vector should be clustered even if the
> pdf is 0. The method which decides whether the vector should be assigned to
> a particular cluster or not -
> >
> > /**
> >      * Decides whether the vector should be classified or not based on
> the max pdf
> >      * value of the clusters and threshold value.
> >      *
> >      * @return whether the vector should be classified or not.
> >      */
> >     private static boolean shouldClassify(Vector pdfPerCluster, Double
> clusterClassificationThreshold) {
> >       return pdfPerCluster.maxValue() >= clusterClassificationThreshold;
> >     }
> >
> > On 17-08-2012 20:06, Whitmore, Mattie wrote:
> >
> >> Hi Ted,
> >>
> >> Yes this is great!  I hope to start working with this algorithm in the
> next couple weeks.
> >>
> >> I have a question about the 0.7 implementation of kmeans and the
> clusterClassificationThreshold,  I have this value set at zero, but the
> output is still showing that about 1/3 of my data is not assigned to a
> cluster in my output.  Am I using this value incorrectly?  I did a
> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite
> the clusterClassificationThreshold = 0.
> >>
> >>
> >> Thanks,
> >>
> >> Mattie
> >>
> >>
> >> -----Original Message-----
> >> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> >> Sent: Wednesday, August 15, 2012 5:20 PM
> >> To: user@mahout.apache.org
> >> Subject: Re: Mahout-279/kmeans++
> >>
> >> Mattie,
> >>
> >> Would this help?
> >>
> >>
> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
> >>
> >> and
> >>
> >>
> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
> >>
> >> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <mwhitmor@harris.com
> >wrote:
> >>
> >>> Hi!
> >>>
> >>> I have been using RandomSeedGenerator, and was hoping it had a patch
> like
> >>> that described in Mahout-279 since I want only 10 vectors out of a set
> of
> >>> more than 100,000,000.  I have been using canopy clustering for better
> >>> results, but still need to do a few passes of kmeans to determine my
> T, and
> >>> the random seed does take a long time.
> >>>
> >>> The comments say that you are working on a kmeans++, I searched around
> but
> >>> couldn't confirm any more information about it.  Is a scalable
> kmeans++ in
> >>> the works? (I know research on the subject is quite new)
> >>>
> >>> Thanks!
> >>>
> >>>
> >>>
> >>> Mattie Whitmore
> >>> Mathematician/IR&D Software Engineer
> >>> HARRIS  Corporation - Advanced Information Solutions
> >>> 301.837.5278
> >>> mwhitmor@harris.com<ma...@harris.com>
> >>>
> >>>
> >>>
> >>>
> >
>

RE: Mahout-279/kmeans++

Posted by "Whitmore, Mattie" <mw...@harris.com>.

I did cross check with ClusterOutputPostProcessorDriver, and the files are filled with the same number of vectors which clusterdumper is counting.  

I have also verified by running canopy multiple times with 0.5 and 0.7 that there is a continual discrepancy between the two clustering versions.  The max/min vectors in a cluster using 0.5 is: 19192158/215  and 0.7 is: 921998/5.  They should not necessarily be the same, since I am using canopy clustering to find initial centroids, however I would think they would have the same sum, which they do not (45901885 vs 1599154).

Here is the method I am running:

public static void KmeansClusteringCanopy(String outputDir, String T, String itMax)
			throws IOException, InterruptedException, ClassNotFoundException,
			InstantiationException, IllegalAccessException {

		Configuration conf = new Configuration();

		DistanceMeasure measure = new EuclideanDistanceMeasure();

		Path vectorsFolder = new Path(outputDir, "vectors");
		Path clusterCenters = new Path(outputDir + "-canopy/centriods");
		Path clusterOutput = new Path(outputDir + "-canopy/clusters");

		// create canopies instead of initial vectors
		CanopyDriver.run(conf, vectorsFolder, clusterCenters, measure,
				Double.parseDouble(T), Double.parseDouble(T), false, 0, false);
		

		// kmeans cluster operation
		KMeansDriver.run(conf, vectorsFolder, new Path(clusterCenters,
				"clusters-0-final/part-r-00000"), clusterOutput, measure, 0.01,
				Integer.parseInt(itMax), true, 0.0, false);
		

		//post process by putting completed clusters into their own files.
		ClusterOutputPostProcessorDriver.run(clusterOutput, 
				new Path(clusterOutput+"/CanopyClusterVectorFolders"), false);		

	}

What do you think?

On another but related note: Is there a plan to have a method -- say ClusterOutputPostProcessorDriver -- which when run outputs the vectors within clusters as well as a separate folder containing pruned outliers?

Thanks!

Mattie

-----Original Message-----
From: Paritosh Ranjan [mailto:pranjan@xebia.com] 
Sent: Friday, August 17, 2012 12:16 PM
To: user@mahout.apache.org
Subject: Re: Mahout-279/kmeans++

The clustering algorithm has also changed internally. So, expect the 
results to be different ( and better ).

I can think of one reason for this behavior. Maybe lots of clusters are 
having only one vector inside it, and, AFAIK, clusterdumper will not 
output any cluster with single vector.
So, I think, its clusterdumper which is doing the invisible "pruning" ( 
by not ouputting clusters with single vectors ).

Can you cross check the output once with ClusterOutputPostProcessorDriver?

No, no tool can output the pruned vectors. The only way to see all 
vectors assigned to any cluster is to set clusterClassificationThreshold 
to 0.

If you still face the problem, then please provide the parameters with 
which you are calling kmeans.

Regarding "I should also mention I have vectors which are exactly the 
same (even their names), perhaps they are the ones being pruned, is that 
possible? "

The name of the vector has nothing to do with clustering, I am not sure 
whether it will have any effect when clusterdumper is in action. So, 
crosschecking with ClusterOutputPostProcessorDriver will answer this.

Good luck.
Paritosh

On 17-08-2012 21:07, Whitmore, Mattie wrote:
> Sure, I have a dataset which I wish to cluster using Kmeans.  Previously (v0.5) when I did a clusterdump the total amount of vectors within the resultant clusters was the same as the total amount fed to the algorithm.  I wish this to be the case when clustering with v0.7.  The only change in the algorithm is clusterClassificationThreshold,  I set this value to be 0 so that it will in fact cluster all vectors in the dataset.
>
> My logic here was no vector should have a probability of being in some cluster less than 0 and therefore all vectors should cluster.
>
> However after running a clusterdump I find that vectors (1/3 roughly) have been pruned.
>
> Is this a bug, or me just not understanding the new capabilities?
>
> I should also mention I have vectors which are exactly the same (even their names), perhaps they are the ones being pruned, is that possible?
>
> Another question if I may: I will eventually want to use the pruning capabilities, does the ClusterOutputPostProcessorDriver method (or a similar method) have the capability of outputting the pruned vectors into a folder?
>
> Thanks! Please let me know if I'm still not being clear enough.
>
> Mattie
>
> -----Original Message-----
> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> Sent: Friday, August 17, 2012 11:20 AM
> To: user@mahout.apache.org
> Subject: Re: Mahout-279/kmeans++
>
> clusterClassificationThreshold is for outlier removal, and this is the way it should be used.
>
> Can you provide some more information about your job and the way you are calling it?
>
> And if I look at the code, the vector should be clustered even if the pdf is 0. The method which decides whether the vector should be assigned to a particular cluster or not -
>
> /**
>      * Decides whether the vector should be classified or not based on the max pdf
>      * value of the clusters and threshold value.
>      *
>      * @return whether the vector should be classified or not.
>      */
>     private static boolean shouldClassify(Vector pdfPerCluster, Double clusterClassificationThreshold) {
>       return pdfPerCluster.maxValue() >= clusterClassificationThreshold;
>     }
>
> On 17-08-2012 20:06, Whitmore, Mattie wrote:
>
>> Hi Ted,
>>
>> Yes this is great!  I hope to start working with this algorithm in the next couple weeks.
>>
>> I have a question about the 0.7 implementation of kmeans and the clusterClassificationThreshold,  I have this value set at zero, but the output is still showing that about 1/3 of my data is not assigned to a cluster in my output.  Am I using this value incorrectly?  I did a kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite the clusterClassificationThreshold = 0.
>>
>>
>> Thanks,
>>
>> Mattie
>>
>>
>> -----Original Message-----
>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>> Sent: Wednesday, August 15, 2012 5:20 PM
>> To: user@mahout.apache.org
>> Subject: Re: Mahout-279/kmeans++
>>
>> Mattie,
>>
>> Would this help?
>>
>> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
>>
>> and
>>
>> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
>>
>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <mw...@harris.com>wrote:
>>
>>> Hi!
>>>
>>> I have been using RandomSeedGenerator, and was hoping it had a patch like
>>> that described in Mahout-279 since I want only 10 vectors out of a set of
>>> more than 100,000,000.  I have been using canopy clustering for better
>>> results, but still need to do a few passes of kmeans to determine my T, and
>>> the random seed does take a long time.
>>>
>>> The comments say that you are working on a kmeans++, I searched around but
>>> couldn't confirm any more information about it.  Is a scalable kmeans++ in
>>> the works? (I know research on the subject is quite new)
>>>
>>> Thanks!
>>>
>>>
>>>
>>> Mattie Whitmore
>>> Mathematician/IR&D Software Engineer
>>> HARRIS  Corporation - Advanced Information Solutions
>>> 301.837.5278
>>> mwhitmor@harris.com<ma...@harris.com>
>>>
>>>
>>>
>>>
>

Re: Mahout-279/kmeans++

Posted by Paritosh Ranjan <pr...@xebia.com>.

The clustering algorithm has also changed internally. So, expect the 
results to be different ( and better ).

I can think of one reason for this behavior. Maybe lots of clusters are 
having only one vector inside it, and, AFAIK, clusterdumper will not 
output any cluster with single vector.
So, I think, its clusterdumper which is doing the invisible "pruning" ( 
by not ouputting clusters with single vectors ).

Can you cross check the output once with ClusterOutputPostProcessorDriver?

No, no tool can output the pruned vectors. The only way to see all 
vectors assigned to any cluster is to set clusterClassificationThreshold 
to 0.

If you still face the problem, then please provide the parameters with 
which you are calling kmeans.

Regarding "I should also mention I have vectors which are exactly the 
same (even their names), perhaps they are the ones being pruned, is that 
possible? "

The name of the vector has nothing to do with clustering, I am not sure 
whether it will have any effect when clusterdumper is in action. So, 
crosschecking with ClusterOutputPostProcessorDriver will answer this.

Good luck.
Paritosh

On 17-08-2012 21:07, Whitmore, Mattie wrote:
> Sure, I have a dataset which I wish to cluster using Kmeans.  Previously (v0.5) when I did a clusterdump the total amount of vectors within the resultant clusters was the same as the total amount fed to the algorithm.  I wish this to be the case when clustering with v0.7.  The only change in the algorithm is clusterClassificationThreshold,  I set this value to be 0 so that it will in fact cluster all vectors in the dataset.
>
> My logic here was no vector should have a probability of being in some cluster less than 0 and therefore all vectors should cluster.
>
> However after running a clusterdump I find that vectors (1/3 roughly) have been pruned.
>
> Is this a bug, or me just not understanding the new capabilities?
>
> I should also mention I have vectors which are exactly the same (even their names), perhaps they are the ones being pruned, is that possible?
>
> Another question if I may: I will eventually want to use the pruning capabilities, does the ClusterOutputPostProcessorDriver method (or a similar method) have the capability of outputting the pruned vectors into a folder?
>
> Thanks! Please let me know if I'm still not being clear enough.
>
> Mattie
>
> -----Original Message-----
> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> Sent: Friday, August 17, 2012 11:20 AM
> To: user@mahout.apache.org
> Subject: Re: Mahout-279/kmeans++
>
> clusterClassificationThreshold is for outlier removal, and this is the way it should be used.
>
> Can you provide some more information about your job and the way you are calling it?
>
> And if I look at the code, the vector should be clustered even if the pdf is 0. The method which decides whether the vector should be assigned to a particular cluster or not -
>
> /**
>      * Decides whether the vector should be classified or not based on the max pdf
>      * value of the clusters and threshold value.
>      *
>      * @return whether the vector should be classified or not.
>      */
>     private static boolean shouldClassify(Vector pdfPerCluster, Double clusterClassificationThreshold) {
>       return pdfPerCluster.maxValue() >= clusterClassificationThreshold;
>     }
>
> On 17-08-2012 20:06, Whitmore, Mattie wrote:
>
>> Hi Ted,
>>
>> Yes this is great!  I hope to start working with this algorithm in the next couple weeks.
>>
>> I have a question about the 0.7 implementation of kmeans and the clusterClassificationThreshold,  I have this value set at zero, but the output is still showing that about 1/3 of my data is not assigned to a cluster in my output.  Am I using this value incorrectly?  I did a kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite the clusterClassificationThreshold = 0.
>>
>>
>> Thanks,
>>
>> Mattie
>>
>>
>> -----Original Message-----
>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>> Sent: Wednesday, August 15, 2012 5:20 PM
>> To: user@mahout.apache.org
>> Subject: Re: Mahout-279/kmeans++
>>
>> Mattie,
>>
>> Would this help?
>>
>> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
>>
>> and
>>
>> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
>>
>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <mw...@harris.com>wrote:
>>
>>> Hi!
>>>
>>> I have been using RandomSeedGenerator, and was hoping it had a patch like
>>> that described in Mahout-279 since I want only 10 vectors out of a set of
>>> more than 100,000,000.  I have been using canopy clustering for better
>>> results, but still need to do a few passes of kmeans to determine my T, and
>>> the random seed does take a long time.
>>>
>>> The comments say that you are working on a kmeans++, I searched around but
>>> couldn't confirm any more information about it.  Is a scalable kmeans++ in
>>> the works? (I know research on the subject is quite new)
>>>
>>> Thanks!
>>>
>>>
>>>
>>> Mattie Whitmore
>>> Mathematician/IR&D Software Engineer
>>> HARRIS  Corporation - Advanced Information Solutions
>>> 301.837.5278
>>> mwhitmor@harris.com<ma...@harris.com>
>>>
>>>
>>>
>>>
>

RE: Mahout-279/kmeans++

Posted by "Whitmore, Mattie" <mw...@harris.com>.

Sure, I have a dataset which I wish to cluster using Kmeans.  Previously (v0.5) when I did a clusterdump the total amount of vectors within the resultant clusters was the same as the total amount fed to the algorithm.  I wish this to be the case when clustering with v0.7.  The only change in the algorithm is clusterClassificationThreshold,  I set this value to be 0 so that it will in fact cluster all vectors in the dataset.

My logic here was no vector should have a probability of being in some cluster less than 0 and therefore all vectors should cluster.

However after running a clusterdump I find that vectors (1/3 roughly) have been pruned.

Is this a bug, or me just not understanding the new capabilities?

I should also mention I have vectors which are exactly the same (even their names), perhaps they are the ones being pruned, is that possible?

Another question if I may: I will eventually want to use the pruning capabilities, does the ClusterOutputPostProcessorDriver method (or a similar method) have the capability of outputting the pruned vectors into a folder?

Thanks! Please let me know if I'm still not being clear enough.

Mattie

-----Original Message-----
From: Paritosh Ranjan [mailto:pranjan@xebia.com] 
Sent: Friday, August 17, 2012 11:20 AM
To: user@mahout.apache.org
Subject: Re: Mahout-279/kmeans++

clusterClassificationThreshold is for outlier removal, and this is the way it should be used.

Can you provide some more information about your job and the way you are calling it?

And if I look at the code, the vector should be clustered even if the pdf is 0. The method which decides whether the vector should be assigned to a particular cluster or not -

/**
    * Decides whether the vector should be classified or not based on the max pdf
    * value of the clusters and threshold value.
    *
    * @return whether the vector should be classified or not.
    */
   private static boolean shouldClassify(Vector pdfPerCluster, Double clusterClassificationThreshold) {
     return pdfPerCluster.maxValue() >= clusterClassificationThreshold;
   }

On 17-08-2012 20:06, Whitmore, Mattie wrote:

> Hi Ted,
>
> Yes this is great!  I hope to start working with this algorithm in the next couple weeks.
>
> I have a question about the 0.7 implementation of kmeans and the clusterClassificationThreshold,  I have this value set at zero, but the output is still showing that about 1/3 of my data is not assigned to a cluster in my output.  Am I using this value incorrectly?  I did a kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite the clusterClassificationThreshold = 0.
>
>
> Thanks,
>
> Mattie
>
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Wednesday, August 15, 2012 5:20 PM
> To: user@mahout.apache.org
> Subject: Re: Mahout-279/kmeans++
>
> Mattie,
>
> Would this help?
>
> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
>
> and
>
> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
>
> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <mw...@harris.com>wrote:
>
>> Hi!
>>
>> I have been using RandomSeedGenerator, and was hoping it had a patch like
>> that described in Mahout-279 since I want only 10 vectors out of a set of
>> more than 100,000,000.  I have been using canopy clustering for better
>> results, but still need to do a few passes of kmeans to determine my T, and
>> the random seed does take a long time.
>>
>> The comments say that you are working on a kmeans++, I searched around but
>> couldn't confirm any more information about it.  Is a scalable kmeans++ in
>> the works? (I know research on the subject is quite new)
>>
>> Thanks!
>>
>>
>>
>> Mattie Whitmore
>> Mathematician/IR&D Software Engineer
>> HARRIS  Corporation - Advanced Information Solutions
>> 301.837.5278
>> mwhitmor@harris.com<ma...@harris.com>
>>
>>
>>
>>

Re: Mahout-279/kmeans++

Posted by Paritosh Ranjan <pr...@xebia.com>.

clusterClassificationThreshold is for outlier removal, and this is the way it should be used.

Can you provide some more information about your job and the way you are calling it?

And if I look at the code, the vector should be clustered even if the pdf is 0. The method which decides whether the vector should be assigned to a particular cluster or not -

/**
    * Decides whether the vector should be classified or not based on the max pdf
    * value of the clusters and threshold value.
    *
    * @return whether the vector should be classified or not.
    */
   private static boolean shouldClassify(Vector pdfPerCluster, Double clusterClassificationThreshold) {
     return pdfPerCluster.maxValue() >= clusterClassificationThreshold;
   }

On 17-08-2012 20:06, Whitmore, Mattie wrote:

> Hi Ted,
>
> Yes this is great!  I hope to start working with this algorithm in the next couple weeks.
>
> I have a question about the 0.7 implementation of kmeans and the clusterClassificationThreshold,  I have this value set at zero, but the output is still showing that about 1/3 of my data is not assigned to a cluster in my output.  Am I using this value incorrectly?  I did a kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite the clusterClassificationThreshold = 0.
>
>
> Thanks,
>
> Mattie
>
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Wednesday, August 15, 2012 5:20 PM
> To: user@mahout.apache.org
> Subject: Re: Mahout-279/kmeans++
>
> Mattie,
>
> Would this help?
>
> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
>
> and
>
> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
>
> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <mw...@harris.com>wrote:
>
>> Hi!
>>
>> I have been using RandomSeedGenerator, and was hoping it had a patch like
>> that described in Mahout-279 since I want only 10 vectors out of a set of
>> more than 100,000,000.  I have been using canopy clustering for better
>> results, but still need to do a few passes of kmeans to determine my T, and
>> the random seed does take a long time.
>>
>> The comments say that you are working on a kmeans++, I searched around but
>> couldn't confirm any more information about it.  Is a scalable kmeans++ in
>> the works? (I know research on the subject is quite new)
>>
>> Thanks!
>>
>>
>>
>> Mattie Whitmore
>> Mathematician/IR&D Software Engineer
>> HARRIS  Corporation - Advanced Information Solutions
>> 301.837.5278
>> mwhitmor@harris.com<ma...@harris.com>
>>
>>
>>
>>

RE: Mahout-279/kmeans++

Posted by "Whitmore, Mattie" <mw...@harris.com>.

Hi Ted,

Yes this is great!  I hope to start working with this algorithm in the next couple weeks.

I have a question about the 0.7 implementation of kmeans and the clusterClassificationThreshold,  I have this value set at zero, but the output is still showing that about 1/3 of my data is not assigned to a cluster in my output.  Am I using this value incorrectly?  I did a kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite the clusterClassificationThreshold = 0.

Thanks,

Mattie

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Wednesday, August 15, 2012 5:20 PM
To: user@mahout.apache.org
Subject: Re: Mahout-279/kmeans++

Mattie,

Would this help?

https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java

and

https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf

On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <mw...@harris.com>wrote:

> Hi!
>
> I have been using RandomSeedGenerator, and was hoping it had a patch like
> that described in Mahout-279 since I want only 10 vectors out of a set of
> more than 100,000,000.  I have been using canopy clustering for better
> results, but still need to do a few passes of kmeans to determine my T, and
> the random seed does take a long time.
>
> The comments say that you are working on a kmeans++, I searched around but
> couldn't confirm any more information about it.  Is a scalable kmeans++ in
> the works? (I know research on the subject is quite new)
>
> Thanks!
>
>
>
> Mattie Whitmore
> Mathematician/IR&D Software Engineer
> HARRIS  Corporation - Advanced Information Solutions
> 301.837.5278
> mwhitmor@harris.com<ma...@harris.com>
>
>
>
>

Re: Mahout-279/kmeans++

Posted by Ted Dunning <te...@gmail.com>.

Mattie,

Would this help?

https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java

and

https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf

On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <mw...@harris.com>wrote:

> Hi!
>
> I have been using RandomSeedGenerator, and was hoping it had a patch like
> that described in Mahout-279 since I want only 10 vectors out of a set of
> more than 100,000,000.  I have been using canopy clustering for better
> results, but still need to do a few passes of kmeans to determine my T, and
> the random seed does take a long time.
>
> The comments say that you are working on a kmeans++, I searched around but
> couldn't confirm any more information about it.  Is a scalable kmeans++ in
> the works? (I know research on the subject is quite new)
>
> Thanks!
>
>
>
> Mattie Whitmore
> Mathematician/IR&D Software Engineer
> HARRIS  Corporation - Advanced Information Solutions
> 301.837.5278
> mwhitmor@harris.com<ma...@harris.com>
>
>
>
>