You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Lee S <sl...@gmail.com> on 2015/01/06 05:29:10 UTC

kmeans result is different from scikit-learn result with center points provided

Hi, I used the synthetic data to test the kmeans method.
And I write the code own to convert center points to sequecefiles.
Then I ran the kmeans with parameter( -i input -o output -c center -x 3 -cd
1  -cl) ,
I compared the dumped clusteredPoints with the result of scikit-learn kmens
result, it's totally different. I'm very confused.

Does anybody ever run kmeans with center points provided and compare the
result with other ml-library?

Re: kmeans result is different from scikit-learn result with center points provided

Posted by Mathieu Blondel <ma...@mblondel.org>.

K-means requires the solution of a non-convex optimization problem.
This means that the solution found by any algorithm highly depends on the
initialization.
You can't expect to get the same results if you don't use the same
initialization.
Even with the same initialization, some slight differences in the
implementation could lead to different results.

HTH,
Mathieu

On Tue, Jan 6, 2015 at 1:29 PM, Lee S <sl...@gmail.com> wrote:

> Hi, I used the synthetic data to test the kmeans method.
> And I write the code own to convert center points to sequecefiles.
> Then I ran the kmeans with parameter( -i input -o output -c center -x 3 -cd
> 1  -cl) ,
> I compared the dumped clusteredPoints with the result of scikit-learn kmens
> result, it's totally different. I'm very confused.
>
> Does anybody ever run kmeans with center points provided and compare the
> result with other ml-library?
>

Re: kmeans result is different from scikit-learn result with center points provided

Posted by Lee S <sl...@gmail.com>.

2015-01-06 15:03 GMT+08:00 Lee S <sl...@gmail.com>:

> But parameters and distance measure is the same. Only difference: Mahout
> kmeans convergence is based on whether every cluster is convergenced.
> scikit-learn is based on  within-cluster sum of squared criterion.
>
> 2015-01-06 14:15 GMT+08:00 Ted Dunning <te...@gmail.com>:
>
>> I don't think that data is sufficiently clusterable to expect a unique
>> solution.
>>
>> Mean squared error would be a better measure of quality.
>>
>>
>>
>> On Mon, Jan 5, 2015 at 10:07 PM, Lee S <sl...@gmail.com> wrote:
>>
>> > Data in thie link:
>> >
>> >
>> http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
>> > .
>> > I convert it to sequencefile with InputDriver.
>> >
>> > 2015-01-06 14:04 GMT+08:00 Ted Dunning <te...@gmail.com>:
>> >
>> > > What kind of synthetic data did you use?
>> > >
>> > >
>> > >
>> > > On Mon, Jan 5, 2015 at 8:29 PM, Lee S <sl...@gmail.com> wrote:
>> > >
>> > > > Hi, I used the synthetic data to test the kmeans method.
>> > > > And I write the code own to convert center points to sequecefiles.
>> > > > Then I ran the kmeans with parameter( -i input -o output -c center
>> -x 3
>> > > -cd
>> > > > 1  -cl) ,
>> > > > I compared the dumped clusteredPoints with the result of
>> scikit-learn
>> > > kmens
>> > > > result, it's totally different. I'm very confused.
>> > > >
>> > > > Does anybody ever run kmeans with center points provided and compare
>> > the
>> > > > result with other ml-library?
>> > > >
>> > >
>> >
>>
>
>

Re: kmeans result is different from scikit-learn result with center points provided

Posted by Ted Dunning <te...@gmail.com>.

Running this gist can be done using the following two lines of R, btw:

library(devtools)
source_url("
https://gist.githubusercontent.com/tdunning/e1575ad2043af732c219/raw/444514454a6f3b5fcbbcaa3f8a919b1965e07f16/Clustering%20is%20hard
")

You should see something like this as output:

SHA-1 hash of file is 2bc9bf7677d6d5b8b7aa1b1d49749574f5bd942e
$fail
[1] 96

$success
[1] 4

counts
 1  2  3  4
 4 71 22  3


On Mon, Jan 5, 2015 at 11:50 PM, Ted Dunning <te...@gmail.com> wrote:

> Clustering is harder than you appear to think:
>
> http://www.imsc.res.in/~meena/papers/kmeans.pdf
>
> https://en.wikipedia.org/wiki/K-means_clustering
>
> NP-hard problems are typically solved by approximation.  K-means is a
> great example.  Only a few, relatively unrealistic, examples have solutions
> apparent enough to be found reliably by diverse algorithms.  For instance,
> something as easy as Gaussian clusters with sd=1e-3 situated on 10 random
> corners of a unit hypercube in 10 dimensional space will be clustered
> differently by many algorithms unless multiple starts are used.
>
> For instance see https://gist.github.com/tdunning/e1575ad2043af732c219
> for an R script that demonstrates that R's standard k-means algorithms fail
> over 95% of the time for this trivial input, occasionally splitting a
> single cluster into three parts.  Restarting multiple times doesn't fix the
> problem ... it only makes it a bit more tolerable.  This example shows how
> even 90 restarts could fail for this particular problem.
>
>
>
>
>
> On Mon, Jan 5, 2015 at 11:03 PM, Lee S <sl...@gmail.com> wrote:
>
>> But parameters and distance measure is the same. Only difference: Mahout
>> kmeans convergence is based on whether every cluster is convergenced.
>> scikit-learn is based on  within-cluster sum of squared criterion.
>>
>> 2015-01-06 14:15 GMT+08:00 Ted Dunning <te...@gmail.com>:
>>
>> > I don't think that data is sufficiently clusterable to expect a unique
>> > solution.
>> >
>> > Mean squared error would be a better measure of quality.
>> >
>> >
>> >
>> > On Mon, Jan 5, 2015 at 10:07 PM, Lee S <sl...@gmail.com> wrote:
>> >
>> > > Data in thie link:
>> > >
>> > >
>> >
>> http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
>> > > .
>> > > I convert it to sequencefile with InputDriver.
>> > >
>> > > 2015-01-06 14:04 GMT+08:00 Ted Dunning <te...@gmail.com>:
>> > >
>> > > > What kind of synthetic data did you use?
>> > > >
>> > > >
>> > > >
>> > > > On Mon, Jan 5, 2015 at 8:29 PM, Lee S <sl...@gmail.com> wrote:
>> > > >
>> > > > > Hi, I used the synthetic data to test the kmeans method.
>> > > > > And I write the code own to convert center points to sequecefiles.
>> > > > > Then I ran the kmeans with parameter( -i input -o output -c center
>> > -x 3
>> > > > -cd
>> > > > > 1  -cl) ,
>> > > > > I compared the dumped clusteredPoints with the result of
>> scikit-learn
>> > > > kmens
>> > > > > result, it's totally different. I'm very confused.
>> > > > >
>> > > > > Does anybody ever run kmeans with center points provided and
>> compare
>> > > the
>> > > > > result with other ml-library?
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: kmeans result is different from scikit-learn result with center points provided

Posted by Ted Dunning <te...@gmail.com>.

Clustering is harder than you appear to think:

http://www.imsc.res.in/~meena/papers/kmeans.pdf

https://en.wikipedia.org/wiki/K-means_clustering

NP-hard problems are typically solved by approximation.  K-means is a great
example.  Only a few, relatively unrealistic, examples have solutions
apparent enough to be found reliably by diverse algorithms.  For instance,
something as easy as Gaussian clusters with sd=1e-3 situated on 10 random
corners of a unit hypercube in 10 dimensional space will be clustered
differently by many algorithms unless multiple starts are used.

For instance see https://gist.github.com/tdunning/e1575ad2043af732c219 for
an R script that demonstrates that R's standard k-means algorithms fail
over 95% of the time for this trivial input, occasionally splitting a
single cluster into three parts.  Restarting multiple times doesn't fix the
problem ... it only makes it a bit more tolerable.  This example shows how
even 90 restarts could fail for this particular problem.

On Mon, Jan 5, 2015 at 11:03 PM, Lee S <sl...@gmail.com> wrote:

> But parameters and distance measure is the same. Only difference: Mahout
> kmeans convergence is based on whether every cluster is convergenced.
> scikit-learn is based on  within-cluster sum of squared criterion.
>
> 2015-01-06 14:15 GMT+08:00 Ted Dunning <te...@gmail.com>:
>
> > I don't think that data is sufficiently clusterable to expect a unique
> > solution.
> >
> > Mean squared error would be a better measure of quality.
> >
> >
> >
> > On Mon, Jan 5, 2015 at 10:07 PM, Lee S <sl...@gmail.com> wrote:
> >
> > > Data in thie link:
> > >
> > >
> >
> http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
> > > .
> > > I convert it to sequencefile with InputDriver.
> > >
> > > 2015-01-06 14:04 GMT+08:00 Ted Dunning <te...@gmail.com>:
> > >
> > > > What kind of synthetic data did you use?
> > > >
> > > >
> > > >
> > > > On Mon, Jan 5, 2015 at 8:29 PM, Lee S <sl...@gmail.com> wrote:
> > > >
> > > > > Hi, I used the synthetic data to test the kmeans method.
> > > > > And I write the code own to convert center points to sequecefiles.
> > > > > Then I ran the kmeans with parameter( -i input -o output -c center
> > -x 3
> > > > -cd
> > > > > 1  -cl) ,
> > > > > I compared the dumped clusteredPoints with the result of
> scikit-learn
> > > > kmens
> > > > > result, it's totally different. I'm very confused.
> > > > >
> > > > > Does anybody ever run kmeans with center points provided and
> compare
> > > the
> > > > > result with other ml-library?
> > > > >
> > > >
> > >
> >
>

Re: kmeans result is different from scikit-learn result with center points provided

Posted by Lee S <sl...@gmail.com>.

But parameters and distance measure is the same. Only difference: Mahout
kmeans convergence is based on whether every cluster is convergenced.
scikit-learn is based on  within-cluster sum of squared criterion.

2015-01-06 14:15 GMT+08:00 Ted Dunning <te...@gmail.com>:

> I don't think that data is sufficiently clusterable to expect a unique
> solution.
>
> Mean squared error would be a better measure of quality.
>
>
>
> On Mon, Jan 5, 2015 at 10:07 PM, Lee S <sl...@gmail.com> wrote:
>
> > Data in thie link:
> >
> >
> http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
> > .
> > I convert it to sequencefile with InputDriver.
> >
> > 2015-01-06 14:04 GMT+08:00 Ted Dunning <te...@gmail.com>:
> >
> > > What kind of synthetic data did you use?
> > >
> > >
> > >
> > > On Mon, Jan 5, 2015 at 8:29 PM, Lee S <sl...@gmail.com> wrote:
> > >
> > > > Hi, I used the synthetic data to test the kmeans method.
> > > > And I write the code own to convert center points to sequecefiles.
> > > > Then I ran the kmeans with parameter( -i input -o output -c center
> -x 3
> > > -cd
> > > > 1  -cl) ,
> > > > I compared the dumped clusteredPoints with the result of scikit-learn
> > > kmens
> > > > result, it's totally different. I'm very confused.
> > > >
> > > > Does anybody ever run kmeans with center points provided and compare
> > the
> > > > result with other ml-library?
> > > >
> > >
> >
>

Re: kmeans result is different from scikit-learn result with center points provided

Posted by Ted Dunning <te...@gmail.com>.

I don't think that data is sufficiently clusterable to expect a unique
solution.

Mean squared error would be a better measure of quality.



On Mon, Jan 5, 2015 at 10:07 PM, Lee S <sl...@gmail.com> wrote:

> Data in thie link:
>
> http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
> .
> I convert it to sequencefile with InputDriver.
>
> 2015-01-06 14:04 GMT+08:00 Ted Dunning <te...@gmail.com>:
>
> > What kind of synthetic data did you use?
> >
> >
> >
> > On Mon, Jan 5, 2015 at 8:29 PM, Lee S <sl...@gmail.com> wrote:
> >
> > > Hi, I used the synthetic data to test the kmeans method.
> > > And I write the code own to convert center points to sequecefiles.
> > > Then I ran the kmeans with parameter( -i input -o output -c center -x 3
> > -cd
> > > 1  -cl) ,
> > > I compared the dumped clusteredPoints with the result of scikit-learn
> > kmens
> > > result, it's totally different. I'm very confused.
> > >
> > > Does anybody ever run kmeans with center points provided and compare
> the
> > > result with other ml-library?
> > >
> >
>

Re: kmeans result is different from scikit-learn result with center points provided

Posted by Lee S <sl...@gmail.com>.

Data in thie link:
http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
.
I convert it to sequencefile with InputDriver.

2015-01-06 14:04 GMT+08:00 Ted Dunning <te...@gmail.com>:

> What kind of synthetic data did you use?
>
>
>
> On Mon, Jan 5, 2015 at 8:29 PM, Lee S <sl...@gmail.com> wrote:
>
> > Hi, I used the synthetic data to test the kmeans method.
> > And I write the code own to convert center points to sequecefiles.
> > Then I ran the kmeans with parameter( -i input -o output -c center -x 3
> -cd
> > 1  -cl) ,
> > I compared the dumped clusteredPoints with the result of scikit-learn
> kmens
> > result, it's totally different. I'm very confused.
> >
> > Does anybody ever run kmeans with center points provided and compare the
> > result with other ml-library?
> >
>

Re: kmeans result is different from scikit-learn result with center points provided

Posted by Ted Dunning <te...@gmail.com>.

What kind of synthetic data did you use?



On Mon, Jan 5, 2015 at 8:29 PM, Lee S <sl...@gmail.com> wrote:

> Hi, I used the synthetic data to test the kmeans method.
> And I write the code own to convert center points to sequecefiles.
> Then I ran the kmeans with parameter( -i input -o output -c center -x 3 -cd
> 1  -cl) ,
> I compared the dumped clusteredPoints with the result of scikit-learn kmens
> result, it's totally different. I'm very confused.
>
> Does anybody ever run kmeans with center points provided and compare the
> result with other ml-library?
>