You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Dan Filimon <da...@gmail.com> on 2013/03/28 00:08:29 UTC

Clustering 20newsgroups with StreamingKMeans [was How to improve clustering?]

Ted, remember we talked about this last week?

The problem was (I think it's fixed now) that when I was asking for 20
clusters, every mapper would give me 20 clusters (rather than k log n
~ 200) and the points clumped together resulting in one cluster with
the vast majority of the points ~17K out the ~19K.

Now that I fixed that added more tests that seem to be confirming all
StreamingKMeans implementations get about the same results (whether
they're local or MapReduce) and the multiple restarts of BallKMeans,
I'm expecting it to be a lot better.

Actual data tests coming soon (please check that new cluster thread). :)

Re: Clustering 20newsgroups with StreamingKMeans [was How to improve clustering?]

Posted by Dan Filimon <da...@gmail.com>.
I'd like to implement the test described in this paper [1] and also
explained in this presentation [2].
I went over the paper and I think I understand it well enough.

The main gist is that in when dealing with high-dimensional data that has
lots of uncorrelated features (which should totally not be the case for
us!), distances becomes meaningless as the ratio between minimum distance
and maximum distance becomes less than some small constant factor.

It's not really about this particular data set, but since I find figuring
out whether distances are relevant or not challenging, I feel that any help
is welcome.

What do you think Ted?

[1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
[2] http://www.cs.bham.ac.uk/~axk/Dagstuhl.pdf


On Thu, Mar 28, 2013 at 10:29 PM, Dan Filimon
<da...@gmail.com>wrote:

> And I'll add that re-vectorizing the documents with my vectorizer yields
> essentially the same results (this is CosineDistance though):
>
> Average distance in cluster 0 [6]: 0.844053
> Average distance in cluster 1 [1047]: 0.988517
> Average distance in cluster 2 [26]: 0.889580
> Average distance in cluster 3 [19]: 0.922804
> Average distance in cluster 4 [2]: 0.414935
> Average distance in cluster 5 [9]: 0.777650
> Average distance in cluster 6 [4]: 0.791443
> Average distance in cluster 7 [17432]: 1.017289
> Average distance in cluster 8 [20]: 0.917523
> Average distance in cluster 9 [4]: 0.744159
> Average distance in cluster 10 [2]: 0.340740
> Average distance in cluster 11 [3]: 0.614734
> Average distance in cluster 12 [2]: 0.624274
> Average distance in cluster 13 [62]: 0.922437
> Average distance in cluster 14 [2]: 0.324862
> Average distance in cluster 15 [1]: 0.000000
> Average distance in cluster 16 [94]: 0.917509
> Average distance in cluster 17 [103]: 0.944392
> Average distance in cluster 18 [7]: 0.795449
> Average distance in cluster 19 [1]: 0.000000
> Num clusters: 20; maxDistance: 1.029701
>
>
> On Thu, Mar 28, 2013 at 6:45 PM, Dan Filimon <da...@gmail.com>wrote:
>
>> You know what's even more odd? When I used Mahout's KMeans, everything
>> was assigned to one single cluster with mean distance 64.
>>
>>
>> On Thu, Mar 28, 2013 at 11:07 AM, Ted Dunning <te...@gmail.com>wrote:
>>
>>> Hmm... looking at these outputs, it looks like the big cluster is really
>>> tight ... much tighter than cluster 3 or 4.  That is very odd.
>>>
>>> On Thu, Mar 28, 2013 at 10:01 AM, Dan Filimon
>>> <da...@gmail.com>wrote:
>>>
>>> > [Yes, it should be on the dev list. I got confused.]
>>> >
>>> > The thing is, it's happening when using just 1 mapper. The hypercube
>>> > tests indicate that the 3 versions of StreamingKMeans produce about
>>> > the same results.
>>> > I haven't tested them on the _unprojected_ vectors though.
>>> >
>>> > Average distance in cluster 0 [18773]: 68.237385
>>> > Average distance in cluster 1 [2]: 5.973227
>>> > Average distance in cluster 2 [1]: 0.000000
>>> > Average distance in cluster 3 [4]: 279.200390
>>> > Average distance in cluster 4 [5]: 394.101672
>>> > Average distance in cluster 5 [4]: 227.845612
>>> > Average distance in cluster 6 [1]: 0.000000
>>> > Average distance in cluster 7 [2]: 28.779806
>>> > Average distance in cluster 8 [1]: 0.000000
>>> > Average distance in cluster 9 [2]: 215.254876
>>> > Average distance in cluster 10 [3]: 128.501163
>>> > Average distance in cluster 11 [8]: 534.401649
>>> > Average distance in cluster 12 [1]: 0.000000
>>> > Average distance in cluster 13 [5]: 405.115140
>>> > Average distance in cluster 14 [1]: 0.000000
>>> > Average distance in cluster 15 [9]: 215.797289
>>> > Average distance in cluster 16 [1]: 0.000000
>>> > Average distance in cluster 17 [2]: 123.065677
>>> > Average distance in cluster 18 [1]: 0.000000
>>> > Average distance in cluster 19 [2]: 98.733778
>>> > Num clusters: 20; maxDistance: 762.326896
>>> >
>>> > On Thu, Mar 28, 2013 at 10:32 AM, Ted Dunning <te...@gmail.com>
>>> > wrote:
>>> > > I will have to think on this a bit.
>>> > >
>>> > > It should be possible to dump the sketches coming from each mapper
>>> and
>>> > look
>>> > > at them for compatibility.
>>> > >
>>> > > Are the mappers seeing only docs from a single news group?  That
>>> might
>>> > > produce some interesting and odd results.
>>> > >
>>> > > What happens with the sequential version when you specify as many
>>> threads
>>> > > as you have mappers in the MR version?
>>> > >
>>> > > Also, sholdn't this be on the dev list?
>>> > >
>>> > > On Thu, Mar 28, 2013 at 9:10 AM, Dan Filimon <
>>> > dangeorge.filimon@gmail.com>wrote:
>>> > >
>>> > >> So no, apparently the problem's still there. With the most recent
>>> code,
>>> > I
>>> > >> get:
>>> > >>
>>> > >> Average distance in cluster 0 [1]: 0.000000
>>> > >> Average distance in cluster 1 [18775]: 63.839819
>>> > >> Average distance in cluster 2 [11]: 448.706077
>>> > >> Average distance in cluster 3 [1]: 0.000000
>>> > >> Average distance in cluster 4 [8]: 213.629578
>>> > >> Average distance in cluster 5 [1]: 0.000000
>>> > >> Average distance in cluster 6 [10]: 369.592682
>>> > >> Average distance in cluster 7 [1]: 0.000000
>>> > >> Average distance in cluster 8 [2]: 31.061103
>>> > >> Average distance in cluster 9 [1]: 0.000000
>>> > >> Average distance in cluster 10 [2]: 309.934857
>>> > >> Average distance in cluster 11 [1]: 0.000000
>>> > >> Average distance in cluster 12 [1]: 0.000000
>>> > >> Average distance in cluster 13 [1]: 0.000000
>>> > >> Average distance in cluster 14 [1]: 0.000000
>>> > >> Average distance in cluster 15 [4]: 229.180504
>>> > >> Average distance in cluster 16 [1]: 0.000000
>>> > >> Average distance in cluster 17 [3]: 336.835246
>>> > >> Average distance in cluster 18 [2]: 76.485594
>>> > >> Average distance in cluster 19 [1]: 0.000000
>>> > >> Num clusters: 20; maxDistance: 724.060033
>>> > >>
>>> > >> I'll have to recheck. :/
>>> > >>
>>> > >> On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <ted.dunning@gmail.com
>>> >
>>> > >> wrote:
>>> > >> > Hot damn!
>>> > >> >
>>> > >> > Well spotted.
>>> > >> >
>>> > >> > On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon
>>> > >> > <da...@gmail.com>wrote:
>>> > >> >
>>> > >> >> Ted, remember we talked about this last week?
>>> > >> >>
>>> > >> >> The problem was (I think it's fixed now) that when I was asking
>>> for
>>> > 20
>>> > >> >> clusters, every mapper would give me 20 clusters (rather than k
>>> log n
>>> > >> >> ~ 200) and the points clumped together resulting in one cluster
>>> with
>>> > >> >> the vast majority of the points ~17K out the ~19K.
>>> > >> >>
>>> > >> >> Now that I fixed that added more tests that seem to be
>>> confirming all
>>> > >> >> StreamingKMeans implementations get about the same results
>>> (whether
>>> > >> >> they're local or MapReduce) and the multiple restarts of
>>> BallKMeans,
>>> > >> >> I'm expecting it to be a lot better.
>>> > >> >>
>>> > >> >> Actual data tests coming soon (please check that new cluster
>>> > thread). :)
>>> > >> >>
>>> > >>
>>> >
>>>
>>
>>
>

Re: Clustering 20newsgroups with StreamingKMeans [was How to improve clustering?]

Posted by Ted Dunning <te...@gmail.com>.
If you used IDF weighting, then I think that cosine weighting is actually
the dot product which is the cosine for unit vectors but whacky for
variable length records.

Even so, I would have expected smaller weights.

On Thu, Mar 28, 2013 at 10:21 PM, Dan Filimon
<da...@gmail.com>wrote:

> You know, regarding the latest clustering with CosineDistance.
> How is the _mean_ distance larger (or even close to) 1 if cos is in [-1,
> 1]? ...
>
>
> On Thu, Mar 28, 2013 at 10:29 PM, Dan Filimon
> <da...@gmail.com>wrote:
>
> > And I'll add that re-vectorizing the documents with my vectorizer yields
> > essentially the same results (this is CosineDistance though):
> >
> > Average distance in cluster 0 [6]: 0.844053
> > Average distance in cluster 1 [1047]: 0.988517
> > Average distance in cluster 2 [26]: 0.889580
> > Average distance in cluster 3 [19]: 0.922804
> > Average distance in cluster 4 [2]: 0.414935
> > Average distance in cluster 5 [9]: 0.777650
> > Average distance in cluster 6 [4]: 0.791443
> > Average distance in cluster 7 [17432]: 1.017289
> > Average distance in cluster 8 [20]: 0.917523
> > Average distance in cluster 9 [4]: 0.744159
> > Average distance in cluster 10 [2]: 0.340740
> > Average distance in cluster 11 [3]: 0.614734
> > Average distance in cluster 12 [2]: 0.624274
> > Average distance in cluster 13 [62]: 0.922437
> > Average distance in cluster 14 [2]: 0.324862
> > Average distance in cluster 15 [1]: 0.000000
> > Average distance in cluster 16 [94]: 0.917509
> > Average distance in cluster 17 [103]: 0.944392
> > Average distance in cluster 18 [7]: 0.795449
> > Average distance in cluster 19 [1]: 0.000000
> > Num clusters: 20; maxDistance: 1.029701
> >
> >
> > On Thu, Mar 28, 2013 at 6:45 PM, Dan Filimon <
> dangeorge.filimon@gmail.com>wrote:
> >
> >> You know what's even more odd? When I used Mahout's KMeans, everything
> >> was assigned to one single cluster with mean distance 64.
> >>
> >>
> >> On Thu, Mar 28, 2013 at 11:07 AM, Ted Dunning <ted.dunning@gmail.com
> >wrote:
> >>
> >>> Hmm... looking at these outputs, it looks like the big cluster is
> really
> >>> tight ... much tighter than cluster 3 or 4.  That is very odd.
> >>>
> >>> On Thu, Mar 28, 2013 at 10:01 AM, Dan Filimon
> >>> <da...@gmail.com>wrote:
> >>>
> >>> > [Yes, it should be on the dev list. I got confused.]
> >>> >
> >>> > The thing is, it's happening when using just 1 mapper. The hypercube
> >>> > tests indicate that the 3 versions of StreamingKMeans produce about
> >>> > the same results.
> >>> > I haven't tested them on the _unprojected_ vectors though.
> >>> >
> >>> > Average distance in cluster 0 [18773]: 68.237385
> >>> > Average distance in cluster 1 [2]: 5.973227
> >>> > Average distance in cluster 2 [1]: 0.000000
> >>> > Average distance in cluster 3 [4]: 279.200390
> >>> > Average distance in cluster 4 [5]: 394.101672
> >>> > Average distance in cluster 5 [4]: 227.845612
> >>> > Average distance in cluster 6 [1]: 0.000000
> >>> > Average distance in cluster 7 [2]: 28.779806
> >>> > Average distance in cluster 8 [1]: 0.000000
> >>> > Average distance in cluster 9 [2]: 215.254876
> >>> > Average distance in cluster 10 [3]: 128.501163
> >>> > Average distance in cluster 11 [8]: 534.401649
> >>> > Average distance in cluster 12 [1]: 0.000000
> >>> > Average distance in cluster 13 [5]: 405.115140
> >>> > Average distance in cluster 14 [1]: 0.000000
> >>> > Average distance in cluster 15 [9]: 215.797289
> >>> > Average distance in cluster 16 [1]: 0.000000
> >>> > Average distance in cluster 17 [2]: 123.065677
> >>> > Average distance in cluster 18 [1]: 0.000000
> >>> > Average distance in cluster 19 [2]: 98.733778
> >>> > Num clusters: 20; maxDistance: 762.326896
> >>> >
> >>> > On Thu, Mar 28, 2013 at 10:32 AM, Ted Dunning <ted.dunning@gmail.com
> >
> >>> > wrote:
> >>> > > I will have to think on this a bit.
> >>> > >
> >>> > > It should be possible to dump the sketches coming from each mapper
> >>> and
> >>> > look
> >>> > > at them for compatibility.
> >>> > >
> >>> > > Are the mappers seeing only docs from a single news group?  That
> >>> might
> >>> > > produce some interesting and odd results.
> >>> > >
> >>> > > What happens with the sequential version when you specify as many
> >>> threads
> >>> > > as you have mappers in the MR version?
> >>> > >
> >>> > > Also, sholdn't this be on the dev list?
> >>> > >
> >>> > > On Thu, Mar 28, 2013 at 9:10 AM, Dan Filimon <
> >>> > dangeorge.filimon@gmail.com>wrote:
> >>> > >
> >>> > >> So no, apparently the problem's still there. With the most recent
> >>> code,
> >>> > I
> >>> > >> get:
> >>> > >>
> >>> > >> Average distance in cluster 0 [1]: 0.000000
> >>> > >> Average distance in cluster 1 [18775]: 63.839819
> >>> > >> Average distance in cluster 2 [11]: 448.706077
> >>> > >> Average distance in cluster 3 [1]: 0.000000
> >>> > >> Average distance in cluster 4 [8]: 213.629578
> >>> > >> Average distance in cluster 5 [1]: 0.000000
> >>> > >> Average distance in cluster 6 [10]: 369.592682
> >>> > >> Average distance in cluster 7 [1]: 0.000000
> >>> > >> Average distance in cluster 8 [2]: 31.061103
> >>> > >> Average distance in cluster 9 [1]: 0.000000
> >>> > >> Average distance in cluster 10 [2]: 309.934857
> >>> > >> Average distance in cluster 11 [1]: 0.000000
> >>> > >> Average distance in cluster 12 [1]: 0.000000
> >>> > >> Average distance in cluster 13 [1]: 0.000000
> >>> > >> Average distance in cluster 14 [1]: 0.000000
> >>> > >> Average distance in cluster 15 [4]: 229.180504
> >>> > >> Average distance in cluster 16 [1]: 0.000000
> >>> > >> Average distance in cluster 17 [3]: 336.835246
> >>> > >> Average distance in cluster 18 [2]: 76.485594
> >>> > >> Average distance in cluster 19 [1]: 0.000000
> >>> > >> Num clusters: 20; maxDistance: 724.060033
> >>> > >>
> >>> > >> I'll have to recheck. :/
> >>> > >>
> >>> > >> On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <
> ted.dunning@gmail.com
> >>> >
> >>> > >> wrote:
> >>> > >> > Hot damn!
> >>> > >> >
> >>> > >> > Well spotted.
> >>> > >> >
> >>> > >> > On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon
> >>> > >> > <da...@gmail.com>wrote:
> >>> > >> >
> >>> > >> >> Ted, remember we talked about this last week?
> >>> > >> >>
> >>> > >> >> The problem was (I think it's fixed now) that when I was asking
> >>> for
> >>> > 20
> >>> > >> >> clusters, every mapper would give me 20 clusters (rather than k
> >>> log n
> >>> > >> >> ~ 200) and the points clumped together resulting in one cluster
> >>> with
> >>> > >> >> the vast majority of the points ~17K out the ~19K.
> >>> > >> >>
> >>> > >> >> Now that I fixed that added more tests that seem to be
> >>> confirming all
> >>> > >> >> StreamingKMeans implementations get about the same results
> >>> (whether
> >>> > >> >> they're local or MapReduce) and the multiple restarts of
> >>> BallKMeans,
> >>> > >> >> I'm expecting it to be a lot better.
> >>> > >> >>
> >>> > >> >> Actual data tests coming soon (please check that new cluster
> >>> > thread). :)
> >>> > >> >>
> >>> > >>
> >>> >
> >>>
> >>
> >>
> >
>

Re: Clustering 20newsgroups with StreamingKMeans [was How to improve clustering?]

Posted by Dan Filimon <da...@gmail.com>.
You know, regarding the latest clustering with CosineDistance.
How is the _mean_ distance larger (or even close to) 1 if cos is in [-1,
1]? ...


On Thu, Mar 28, 2013 at 10:29 PM, Dan Filimon
<da...@gmail.com>wrote:

> And I'll add that re-vectorizing the documents with my vectorizer yields
> essentially the same results (this is CosineDistance though):
>
> Average distance in cluster 0 [6]: 0.844053
> Average distance in cluster 1 [1047]: 0.988517
> Average distance in cluster 2 [26]: 0.889580
> Average distance in cluster 3 [19]: 0.922804
> Average distance in cluster 4 [2]: 0.414935
> Average distance in cluster 5 [9]: 0.777650
> Average distance in cluster 6 [4]: 0.791443
> Average distance in cluster 7 [17432]: 1.017289
> Average distance in cluster 8 [20]: 0.917523
> Average distance in cluster 9 [4]: 0.744159
> Average distance in cluster 10 [2]: 0.340740
> Average distance in cluster 11 [3]: 0.614734
> Average distance in cluster 12 [2]: 0.624274
> Average distance in cluster 13 [62]: 0.922437
> Average distance in cluster 14 [2]: 0.324862
> Average distance in cluster 15 [1]: 0.000000
> Average distance in cluster 16 [94]: 0.917509
> Average distance in cluster 17 [103]: 0.944392
> Average distance in cluster 18 [7]: 0.795449
> Average distance in cluster 19 [1]: 0.000000
> Num clusters: 20; maxDistance: 1.029701
>
>
> On Thu, Mar 28, 2013 at 6:45 PM, Dan Filimon <da...@gmail.com>wrote:
>
>> You know what's even more odd? When I used Mahout's KMeans, everything
>> was assigned to one single cluster with mean distance 64.
>>
>>
>> On Thu, Mar 28, 2013 at 11:07 AM, Ted Dunning <te...@gmail.com>wrote:
>>
>>> Hmm... looking at these outputs, it looks like the big cluster is really
>>> tight ... much tighter than cluster 3 or 4.  That is very odd.
>>>
>>> On Thu, Mar 28, 2013 at 10:01 AM, Dan Filimon
>>> <da...@gmail.com>wrote:
>>>
>>> > [Yes, it should be on the dev list. I got confused.]
>>> >
>>> > The thing is, it's happening when using just 1 mapper. The hypercube
>>> > tests indicate that the 3 versions of StreamingKMeans produce about
>>> > the same results.
>>> > I haven't tested them on the _unprojected_ vectors though.
>>> >
>>> > Average distance in cluster 0 [18773]: 68.237385
>>> > Average distance in cluster 1 [2]: 5.973227
>>> > Average distance in cluster 2 [1]: 0.000000
>>> > Average distance in cluster 3 [4]: 279.200390
>>> > Average distance in cluster 4 [5]: 394.101672
>>> > Average distance in cluster 5 [4]: 227.845612
>>> > Average distance in cluster 6 [1]: 0.000000
>>> > Average distance in cluster 7 [2]: 28.779806
>>> > Average distance in cluster 8 [1]: 0.000000
>>> > Average distance in cluster 9 [2]: 215.254876
>>> > Average distance in cluster 10 [3]: 128.501163
>>> > Average distance in cluster 11 [8]: 534.401649
>>> > Average distance in cluster 12 [1]: 0.000000
>>> > Average distance in cluster 13 [5]: 405.115140
>>> > Average distance in cluster 14 [1]: 0.000000
>>> > Average distance in cluster 15 [9]: 215.797289
>>> > Average distance in cluster 16 [1]: 0.000000
>>> > Average distance in cluster 17 [2]: 123.065677
>>> > Average distance in cluster 18 [1]: 0.000000
>>> > Average distance in cluster 19 [2]: 98.733778
>>> > Num clusters: 20; maxDistance: 762.326896
>>> >
>>> > On Thu, Mar 28, 2013 at 10:32 AM, Ted Dunning <te...@gmail.com>
>>> > wrote:
>>> > > I will have to think on this a bit.
>>> > >
>>> > > It should be possible to dump the sketches coming from each mapper
>>> and
>>> > look
>>> > > at them for compatibility.
>>> > >
>>> > > Are the mappers seeing only docs from a single news group?  That
>>> might
>>> > > produce some interesting and odd results.
>>> > >
>>> > > What happens with the sequential version when you specify as many
>>> threads
>>> > > as you have mappers in the MR version?
>>> > >
>>> > > Also, sholdn't this be on the dev list?
>>> > >
>>> > > On Thu, Mar 28, 2013 at 9:10 AM, Dan Filimon <
>>> > dangeorge.filimon@gmail.com>wrote:
>>> > >
>>> > >> So no, apparently the problem's still there. With the most recent
>>> code,
>>> > I
>>> > >> get:
>>> > >>
>>> > >> Average distance in cluster 0 [1]: 0.000000
>>> > >> Average distance in cluster 1 [18775]: 63.839819
>>> > >> Average distance in cluster 2 [11]: 448.706077
>>> > >> Average distance in cluster 3 [1]: 0.000000
>>> > >> Average distance in cluster 4 [8]: 213.629578
>>> > >> Average distance in cluster 5 [1]: 0.000000
>>> > >> Average distance in cluster 6 [10]: 369.592682
>>> > >> Average distance in cluster 7 [1]: 0.000000
>>> > >> Average distance in cluster 8 [2]: 31.061103
>>> > >> Average distance in cluster 9 [1]: 0.000000
>>> > >> Average distance in cluster 10 [2]: 309.934857
>>> > >> Average distance in cluster 11 [1]: 0.000000
>>> > >> Average distance in cluster 12 [1]: 0.000000
>>> > >> Average distance in cluster 13 [1]: 0.000000
>>> > >> Average distance in cluster 14 [1]: 0.000000
>>> > >> Average distance in cluster 15 [4]: 229.180504
>>> > >> Average distance in cluster 16 [1]: 0.000000
>>> > >> Average distance in cluster 17 [3]: 336.835246
>>> > >> Average distance in cluster 18 [2]: 76.485594
>>> > >> Average distance in cluster 19 [1]: 0.000000
>>> > >> Num clusters: 20; maxDistance: 724.060033
>>> > >>
>>> > >> I'll have to recheck. :/
>>> > >>
>>> > >> On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <ted.dunning@gmail.com
>>> >
>>> > >> wrote:
>>> > >> > Hot damn!
>>> > >> >
>>> > >> > Well spotted.
>>> > >> >
>>> > >> > On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon
>>> > >> > <da...@gmail.com>wrote:
>>> > >> >
>>> > >> >> Ted, remember we talked about this last week?
>>> > >> >>
>>> > >> >> The problem was (I think it's fixed now) that when I was asking
>>> for
>>> > 20
>>> > >> >> clusters, every mapper would give me 20 clusters (rather than k
>>> log n
>>> > >> >> ~ 200) and the points clumped together resulting in one cluster
>>> with
>>> > >> >> the vast majority of the points ~17K out the ~19K.
>>> > >> >>
>>> > >> >> Now that I fixed that added more tests that seem to be
>>> confirming all
>>> > >> >> StreamingKMeans implementations get about the same results
>>> (whether
>>> > >> >> they're local or MapReduce) and the multiple restarts of
>>> BallKMeans,
>>> > >> >> I'm expecting it to be a lot better.
>>> > >> >>
>>> > >> >> Actual data tests coming soon (please check that new cluster
>>> > thread). :)
>>> > >> >>
>>> > >>
>>> >
>>>
>>
>>
>

Re: Clustering 20newsgroups with StreamingKMeans [was How to improve clustering?]

Posted by Dan Filimon <da...@gmail.com>.
And I'll add that re-vectorizing the documents with my vectorizer yields
essentially the same results (this is CosineDistance though):

Average distance in cluster 0 [6]: 0.844053
Average distance in cluster 1 [1047]: 0.988517
Average distance in cluster 2 [26]: 0.889580
Average distance in cluster 3 [19]: 0.922804
Average distance in cluster 4 [2]: 0.414935
Average distance in cluster 5 [9]: 0.777650
Average distance in cluster 6 [4]: 0.791443
Average distance in cluster 7 [17432]: 1.017289
Average distance in cluster 8 [20]: 0.917523
Average distance in cluster 9 [4]: 0.744159
Average distance in cluster 10 [2]: 0.340740
Average distance in cluster 11 [3]: 0.614734
Average distance in cluster 12 [2]: 0.624274
Average distance in cluster 13 [62]: 0.922437
Average distance in cluster 14 [2]: 0.324862
Average distance in cluster 15 [1]: 0.000000
Average distance in cluster 16 [94]: 0.917509
Average distance in cluster 17 [103]: 0.944392
Average distance in cluster 18 [7]: 0.795449
Average distance in cluster 19 [1]: 0.000000
Num clusters: 20; maxDistance: 1.029701


On Thu, Mar 28, 2013 at 6:45 PM, Dan Filimon <da...@gmail.com>wrote:

> You know what's even more odd? When I used Mahout's KMeans, everything was
> assigned to one single cluster with mean distance 64.
>
>
> On Thu, Mar 28, 2013 at 11:07 AM, Ted Dunning <te...@gmail.com>wrote:
>
>> Hmm... looking at these outputs, it looks like the big cluster is really
>> tight ... much tighter than cluster 3 or 4.  That is very odd.
>>
>> On Thu, Mar 28, 2013 at 10:01 AM, Dan Filimon
>> <da...@gmail.com>wrote:
>>
>> > [Yes, it should be on the dev list. I got confused.]
>> >
>> > The thing is, it's happening when using just 1 mapper. The hypercube
>> > tests indicate that the 3 versions of StreamingKMeans produce about
>> > the same results.
>> > I haven't tested them on the _unprojected_ vectors though.
>> >
>> > Average distance in cluster 0 [18773]: 68.237385
>> > Average distance in cluster 1 [2]: 5.973227
>> > Average distance in cluster 2 [1]: 0.000000
>> > Average distance in cluster 3 [4]: 279.200390
>> > Average distance in cluster 4 [5]: 394.101672
>> > Average distance in cluster 5 [4]: 227.845612
>> > Average distance in cluster 6 [1]: 0.000000
>> > Average distance in cluster 7 [2]: 28.779806
>> > Average distance in cluster 8 [1]: 0.000000
>> > Average distance in cluster 9 [2]: 215.254876
>> > Average distance in cluster 10 [3]: 128.501163
>> > Average distance in cluster 11 [8]: 534.401649
>> > Average distance in cluster 12 [1]: 0.000000
>> > Average distance in cluster 13 [5]: 405.115140
>> > Average distance in cluster 14 [1]: 0.000000
>> > Average distance in cluster 15 [9]: 215.797289
>> > Average distance in cluster 16 [1]: 0.000000
>> > Average distance in cluster 17 [2]: 123.065677
>> > Average distance in cluster 18 [1]: 0.000000
>> > Average distance in cluster 19 [2]: 98.733778
>> > Num clusters: 20; maxDistance: 762.326896
>> >
>> > On Thu, Mar 28, 2013 at 10:32 AM, Ted Dunning <te...@gmail.com>
>> > wrote:
>> > > I will have to think on this a bit.
>> > >
>> > > It should be possible to dump the sketches coming from each mapper and
>> > look
>> > > at them for compatibility.
>> > >
>> > > Are the mappers seeing only docs from a single news group?  That might
>> > > produce some interesting and odd results.
>> > >
>> > > What happens with the sequential version when you specify as many
>> threads
>> > > as you have mappers in the MR version?
>> > >
>> > > Also, sholdn't this be on the dev list?
>> > >
>> > > On Thu, Mar 28, 2013 at 9:10 AM, Dan Filimon <
>> > dangeorge.filimon@gmail.com>wrote:
>> > >
>> > >> So no, apparently the problem's still there. With the most recent
>> code,
>> > I
>> > >> get:
>> > >>
>> > >> Average distance in cluster 0 [1]: 0.000000
>> > >> Average distance in cluster 1 [18775]: 63.839819
>> > >> Average distance in cluster 2 [11]: 448.706077
>> > >> Average distance in cluster 3 [1]: 0.000000
>> > >> Average distance in cluster 4 [8]: 213.629578
>> > >> Average distance in cluster 5 [1]: 0.000000
>> > >> Average distance in cluster 6 [10]: 369.592682
>> > >> Average distance in cluster 7 [1]: 0.000000
>> > >> Average distance in cluster 8 [2]: 31.061103
>> > >> Average distance in cluster 9 [1]: 0.000000
>> > >> Average distance in cluster 10 [2]: 309.934857
>> > >> Average distance in cluster 11 [1]: 0.000000
>> > >> Average distance in cluster 12 [1]: 0.000000
>> > >> Average distance in cluster 13 [1]: 0.000000
>> > >> Average distance in cluster 14 [1]: 0.000000
>> > >> Average distance in cluster 15 [4]: 229.180504
>> > >> Average distance in cluster 16 [1]: 0.000000
>> > >> Average distance in cluster 17 [3]: 336.835246
>> > >> Average distance in cluster 18 [2]: 76.485594
>> > >> Average distance in cluster 19 [1]: 0.000000
>> > >> Num clusters: 20; maxDistance: 724.060033
>> > >>
>> > >> I'll have to recheck. :/
>> > >>
>> > >> On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <te...@gmail.com>
>> > >> wrote:
>> > >> > Hot damn!
>> > >> >
>> > >> > Well spotted.
>> > >> >
>> > >> > On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon
>> > >> > <da...@gmail.com>wrote:
>> > >> >
>> > >> >> Ted, remember we talked about this last week?
>> > >> >>
>> > >> >> The problem was (I think it's fixed now) that when I was asking
>> for
>> > 20
>> > >> >> clusters, every mapper would give me 20 clusters (rather than k
>> log n
>> > >> >> ~ 200) and the points clumped together resulting in one cluster
>> with
>> > >> >> the vast majority of the points ~17K out the ~19K.
>> > >> >>
>> > >> >> Now that I fixed that added more tests that seem to be confirming
>> all
>> > >> >> StreamingKMeans implementations get about the same results
>> (whether
>> > >> >> they're local or MapReduce) and the multiple restarts of
>> BallKMeans,
>> > >> >> I'm expecting it to be a lot better.
>> > >> >>
>> > >> >> Actual data tests coming soon (please check that new cluster
>> > thread). :)
>> > >> >>
>> > >>
>> >
>>
>
>

Re: Clustering 20newsgroups with StreamingKMeans [was How to improve clustering?]

Posted by Dan Filimon <da...@gmail.com>.
You know what's even more odd? When I used Mahout's KMeans, everything was
assigned to one single cluster with mean distance 64.


On Thu, Mar 28, 2013 at 11:07 AM, Ted Dunning <te...@gmail.com> wrote:

> Hmm... looking at these outputs, it looks like the big cluster is really
> tight ... much tighter than cluster 3 or 4.  That is very odd.
>
> On Thu, Mar 28, 2013 at 10:01 AM, Dan Filimon
> <da...@gmail.com>wrote:
>
> > [Yes, it should be on the dev list. I got confused.]
> >
> > The thing is, it's happening when using just 1 mapper. The hypercube
> > tests indicate that the 3 versions of StreamingKMeans produce about
> > the same results.
> > I haven't tested them on the _unprojected_ vectors though.
> >
> > Average distance in cluster 0 [18773]: 68.237385
> > Average distance in cluster 1 [2]: 5.973227
> > Average distance in cluster 2 [1]: 0.000000
> > Average distance in cluster 3 [4]: 279.200390
> > Average distance in cluster 4 [5]: 394.101672
> > Average distance in cluster 5 [4]: 227.845612
> > Average distance in cluster 6 [1]: 0.000000
> > Average distance in cluster 7 [2]: 28.779806
> > Average distance in cluster 8 [1]: 0.000000
> > Average distance in cluster 9 [2]: 215.254876
> > Average distance in cluster 10 [3]: 128.501163
> > Average distance in cluster 11 [8]: 534.401649
> > Average distance in cluster 12 [1]: 0.000000
> > Average distance in cluster 13 [5]: 405.115140
> > Average distance in cluster 14 [1]: 0.000000
> > Average distance in cluster 15 [9]: 215.797289
> > Average distance in cluster 16 [1]: 0.000000
> > Average distance in cluster 17 [2]: 123.065677
> > Average distance in cluster 18 [1]: 0.000000
> > Average distance in cluster 19 [2]: 98.733778
> > Num clusters: 20; maxDistance: 762.326896
> >
> > On Thu, Mar 28, 2013 at 10:32 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> > > I will have to think on this a bit.
> > >
> > > It should be possible to dump the sketches coming from each mapper and
> > look
> > > at them for compatibility.
> > >
> > > Are the mappers seeing only docs from a single news group?  That might
> > > produce some interesting and odd results.
> > >
> > > What happens with the sequential version when you specify as many
> threads
> > > as you have mappers in the MR version?
> > >
> > > Also, sholdn't this be on the dev list?
> > >
> > > On Thu, Mar 28, 2013 at 9:10 AM, Dan Filimon <
> > dangeorge.filimon@gmail.com>wrote:
> > >
> > >> So no, apparently the problem's still there. With the most recent
> code,
> > I
> > >> get:
> > >>
> > >> Average distance in cluster 0 [1]: 0.000000
> > >> Average distance in cluster 1 [18775]: 63.839819
> > >> Average distance in cluster 2 [11]: 448.706077
> > >> Average distance in cluster 3 [1]: 0.000000
> > >> Average distance in cluster 4 [8]: 213.629578
> > >> Average distance in cluster 5 [1]: 0.000000
> > >> Average distance in cluster 6 [10]: 369.592682
> > >> Average distance in cluster 7 [1]: 0.000000
> > >> Average distance in cluster 8 [2]: 31.061103
> > >> Average distance in cluster 9 [1]: 0.000000
> > >> Average distance in cluster 10 [2]: 309.934857
> > >> Average distance in cluster 11 [1]: 0.000000
> > >> Average distance in cluster 12 [1]: 0.000000
> > >> Average distance in cluster 13 [1]: 0.000000
> > >> Average distance in cluster 14 [1]: 0.000000
> > >> Average distance in cluster 15 [4]: 229.180504
> > >> Average distance in cluster 16 [1]: 0.000000
> > >> Average distance in cluster 17 [3]: 336.835246
> > >> Average distance in cluster 18 [2]: 76.485594
> > >> Average distance in cluster 19 [1]: 0.000000
> > >> Num clusters: 20; maxDistance: 724.060033
> > >>
> > >> I'll have to recheck. :/
> > >>
> > >> On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <te...@gmail.com>
> > >> wrote:
> > >> > Hot damn!
> > >> >
> > >> > Well spotted.
> > >> >
> > >> > On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon
> > >> > <da...@gmail.com>wrote:
> > >> >
> > >> >> Ted, remember we talked about this last week?
> > >> >>
> > >> >> The problem was (I think it's fixed now) that when I was asking for
> > 20
> > >> >> clusters, every mapper would give me 20 clusters (rather than k
> log n
> > >> >> ~ 200) and the points clumped together resulting in one cluster
> with
> > >> >> the vast majority of the points ~17K out the ~19K.
> > >> >>
> > >> >> Now that I fixed that added more tests that seem to be confirming
> all
> > >> >> StreamingKMeans implementations get about the same results (whether
> > >> >> they're local or MapReduce) and the multiple restarts of
> BallKMeans,
> > >> >> I'm expecting it to be a lot better.
> > >> >>
> > >> >> Actual data tests coming soon (please check that new cluster
> > thread). :)
> > >> >>
> > >>
> >
>

Re: Clustering 20newsgroups with StreamingKMeans [was How to improve clustering?]

Posted by Ted Dunning <te...@gmail.com>.
Hmm... looking at these outputs, it looks like the big cluster is really
tight ... much tighter than cluster 3 or 4.  That is very odd.

On Thu, Mar 28, 2013 at 10:01 AM, Dan Filimon
<da...@gmail.com>wrote:

> [Yes, it should be on the dev list. I got confused.]
>
> The thing is, it's happening when using just 1 mapper. The hypercube
> tests indicate that the 3 versions of StreamingKMeans produce about
> the same results.
> I haven't tested them on the _unprojected_ vectors though.
>
> Average distance in cluster 0 [18773]: 68.237385
> Average distance in cluster 1 [2]: 5.973227
> Average distance in cluster 2 [1]: 0.000000
> Average distance in cluster 3 [4]: 279.200390
> Average distance in cluster 4 [5]: 394.101672
> Average distance in cluster 5 [4]: 227.845612
> Average distance in cluster 6 [1]: 0.000000
> Average distance in cluster 7 [2]: 28.779806
> Average distance in cluster 8 [1]: 0.000000
> Average distance in cluster 9 [2]: 215.254876
> Average distance in cluster 10 [3]: 128.501163
> Average distance in cluster 11 [8]: 534.401649
> Average distance in cluster 12 [1]: 0.000000
> Average distance in cluster 13 [5]: 405.115140
> Average distance in cluster 14 [1]: 0.000000
> Average distance in cluster 15 [9]: 215.797289
> Average distance in cluster 16 [1]: 0.000000
> Average distance in cluster 17 [2]: 123.065677
> Average distance in cluster 18 [1]: 0.000000
> Average distance in cluster 19 [2]: 98.733778
> Num clusters: 20; maxDistance: 762.326896
>
> On Thu, Mar 28, 2013 at 10:32 AM, Ted Dunning <te...@gmail.com>
> wrote:
> > I will have to think on this a bit.
> >
> > It should be possible to dump the sketches coming from each mapper and
> look
> > at them for compatibility.
> >
> > Are the mappers seeing only docs from a single news group?  That might
> > produce some interesting and odd results.
> >
> > What happens with the sequential version when you specify as many threads
> > as you have mappers in the MR version?
> >
> > Also, sholdn't this be on the dev list?
> >
> > On Thu, Mar 28, 2013 at 9:10 AM, Dan Filimon <
> dangeorge.filimon@gmail.com>wrote:
> >
> >> So no, apparently the problem's still there. With the most recent code,
> I
> >> get:
> >>
> >> Average distance in cluster 0 [1]: 0.000000
> >> Average distance in cluster 1 [18775]: 63.839819
> >> Average distance in cluster 2 [11]: 448.706077
> >> Average distance in cluster 3 [1]: 0.000000
> >> Average distance in cluster 4 [8]: 213.629578
> >> Average distance in cluster 5 [1]: 0.000000
> >> Average distance in cluster 6 [10]: 369.592682
> >> Average distance in cluster 7 [1]: 0.000000
> >> Average distance in cluster 8 [2]: 31.061103
> >> Average distance in cluster 9 [1]: 0.000000
> >> Average distance in cluster 10 [2]: 309.934857
> >> Average distance in cluster 11 [1]: 0.000000
> >> Average distance in cluster 12 [1]: 0.000000
> >> Average distance in cluster 13 [1]: 0.000000
> >> Average distance in cluster 14 [1]: 0.000000
> >> Average distance in cluster 15 [4]: 229.180504
> >> Average distance in cluster 16 [1]: 0.000000
> >> Average distance in cluster 17 [3]: 336.835246
> >> Average distance in cluster 18 [2]: 76.485594
> >> Average distance in cluster 19 [1]: 0.000000
> >> Num clusters: 20; maxDistance: 724.060033
> >>
> >> I'll have to recheck. :/
> >>
> >> On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >> > Hot damn!
> >> >
> >> > Well spotted.
> >> >
> >> > On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon
> >> > <da...@gmail.com>wrote:
> >> >
> >> >> Ted, remember we talked about this last week?
> >> >>
> >> >> The problem was (I think it's fixed now) that when I was asking for
> 20
> >> >> clusters, every mapper would give me 20 clusters (rather than k log n
> >> >> ~ 200) and the points clumped together resulting in one cluster with
> >> >> the vast majority of the points ~17K out the ~19K.
> >> >>
> >> >> Now that I fixed that added more tests that seem to be confirming all
> >> >> StreamingKMeans implementations get about the same results (whether
> >> >> they're local or MapReduce) and the multiple restarts of BallKMeans,
> >> >> I'm expecting it to be a lot better.
> >> >>
> >> >> Actual data tests coming soon (please check that new cluster
> thread). :)
> >> >>
> >>
>

Re: Clustering 20newsgroups with StreamingKMeans [was How to improve clustering?]

Posted by Dan Filimon <da...@gmail.com>.
[Yes, it should be on the dev list. I got confused.]

The thing is, it's happening when using just 1 mapper. The hypercube
tests indicate that the 3 versions of StreamingKMeans produce about
the same results.
I haven't tested them on the _unprojected_ vectors though.

Average distance in cluster 0 [18773]: 68.237385
Average distance in cluster 1 [2]: 5.973227
Average distance in cluster 2 [1]: 0.000000
Average distance in cluster 3 [4]: 279.200390
Average distance in cluster 4 [5]: 394.101672
Average distance in cluster 5 [4]: 227.845612
Average distance in cluster 6 [1]: 0.000000
Average distance in cluster 7 [2]: 28.779806
Average distance in cluster 8 [1]: 0.000000
Average distance in cluster 9 [2]: 215.254876
Average distance in cluster 10 [3]: 128.501163
Average distance in cluster 11 [8]: 534.401649
Average distance in cluster 12 [1]: 0.000000
Average distance in cluster 13 [5]: 405.115140
Average distance in cluster 14 [1]: 0.000000
Average distance in cluster 15 [9]: 215.797289
Average distance in cluster 16 [1]: 0.000000
Average distance in cluster 17 [2]: 123.065677
Average distance in cluster 18 [1]: 0.000000
Average distance in cluster 19 [2]: 98.733778
Num clusters: 20; maxDistance: 762.326896

On Thu, Mar 28, 2013 at 10:32 AM, Ted Dunning <te...@gmail.com> wrote:
> I will have to think on this a bit.
>
> It should be possible to dump the sketches coming from each mapper and look
> at them for compatibility.
>
> Are the mappers seeing only docs from a single news group?  That might
> produce some interesting and odd results.
>
> What happens with the sequential version when you specify as many threads
> as you have mappers in the MR version?
>
> Also, sholdn't this be on the dev list?
>
> On Thu, Mar 28, 2013 at 9:10 AM, Dan Filimon <da...@gmail.com>wrote:
>
>> So no, apparently the problem's still there. With the most recent code, I
>> get:
>>
>> Average distance in cluster 0 [1]: 0.000000
>> Average distance in cluster 1 [18775]: 63.839819
>> Average distance in cluster 2 [11]: 448.706077
>> Average distance in cluster 3 [1]: 0.000000
>> Average distance in cluster 4 [8]: 213.629578
>> Average distance in cluster 5 [1]: 0.000000
>> Average distance in cluster 6 [10]: 369.592682
>> Average distance in cluster 7 [1]: 0.000000
>> Average distance in cluster 8 [2]: 31.061103
>> Average distance in cluster 9 [1]: 0.000000
>> Average distance in cluster 10 [2]: 309.934857
>> Average distance in cluster 11 [1]: 0.000000
>> Average distance in cluster 12 [1]: 0.000000
>> Average distance in cluster 13 [1]: 0.000000
>> Average distance in cluster 14 [1]: 0.000000
>> Average distance in cluster 15 [4]: 229.180504
>> Average distance in cluster 16 [1]: 0.000000
>> Average distance in cluster 17 [3]: 336.835246
>> Average distance in cluster 18 [2]: 76.485594
>> Average distance in cluster 19 [1]: 0.000000
>> Num clusters: 20; maxDistance: 724.060033
>>
>> I'll have to recheck. :/
>>
>> On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > Hot damn!
>> >
>> > Well spotted.
>> >
>> > On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon
>> > <da...@gmail.com>wrote:
>> >
>> >> Ted, remember we talked about this last week?
>> >>
>> >> The problem was (I think it's fixed now) that when I was asking for 20
>> >> clusters, every mapper would give me 20 clusters (rather than k log n
>> >> ~ 200) and the points clumped together resulting in one cluster with
>> >> the vast majority of the points ~17K out the ~19K.
>> >>
>> >> Now that I fixed that added more tests that seem to be confirming all
>> >> StreamingKMeans implementations get about the same results (whether
>> >> they're local or MapReduce) and the multiple restarts of BallKMeans,
>> >> I'm expecting it to be a lot better.
>> >>
>> >> Actual data tests coming soon (please check that new cluster thread). :)
>> >>
>>

Re: Clustering 20newsgroups with StreamingKMeans [was How to improve clustering?]

Posted by Ted Dunning <te...@gmail.com>.
I will have to think on this a bit.

It should be possible to dump the sketches coming from each mapper and look
at them for compatibility.

Are the mappers seeing only docs from a single news group?  That might
produce some interesting and odd results.

What happens with the sequential version when you specify as many threads
as you have mappers in the MR version?

Also, sholdn't this be on the dev list?

On Thu, Mar 28, 2013 at 9:10 AM, Dan Filimon <da...@gmail.com>wrote:

> So no, apparently the problem's still there. With the most recent code, I
> get:
>
> Average distance in cluster 0 [1]: 0.000000
> Average distance in cluster 1 [18775]: 63.839819
> Average distance in cluster 2 [11]: 448.706077
> Average distance in cluster 3 [1]: 0.000000
> Average distance in cluster 4 [8]: 213.629578
> Average distance in cluster 5 [1]: 0.000000
> Average distance in cluster 6 [10]: 369.592682
> Average distance in cluster 7 [1]: 0.000000
> Average distance in cluster 8 [2]: 31.061103
> Average distance in cluster 9 [1]: 0.000000
> Average distance in cluster 10 [2]: 309.934857
> Average distance in cluster 11 [1]: 0.000000
> Average distance in cluster 12 [1]: 0.000000
> Average distance in cluster 13 [1]: 0.000000
> Average distance in cluster 14 [1]: 0.000000
> Average distance in cluster 15 [4]: 229.180504
> Average distance in cluster 16 [1]: 0.000000
> Average distance in cluster 17 [3]: 336.835246
> Average distance in cluster 18 [2]: 76.485594
> Average distance in cluster 19 [1]: 0.000000
> Num clusters: 20; maxDistance: 724.060033
>
> I'll have to recheck. :/
>
> On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <te...@gmail.com>
> wrote:
> > Hot damn!
> >
> > Well spotted.
> >
> > On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon
> > <da...@gmail.com>wrote:
> >
> >> Ted, remember we talked about this last week?
> >>
> >> The problem was (I think it's fixed now) that when I was asking for 20
> >> clusters, every mapper would give me 20 clusters (rather than k log n
> >> ~ 200) and the points clumped together resulting in one cluster with
> >> the vast majority of the points ~17K out the ~19K.
> >>
> >> Now that I fixed that added more tests that seem to be confirming all
> >> StreamingKMeans implementations get about the same results (whether
> >> they're local or MapReduce) and the multiple restarts of BallKMeans,
> >> I'm expecting it to be a lot better.
> >>
> >> Actual data tests coming soon (please check that new cluster thread). :)
> >>
>

Re: Clustering 20newsgroups with StreamingKMeans [was How to improve clustering?]

Posted by Dan Filimon <da...@gmail.com>.
So no, apparently the problem's still there. With the most recent code, I get:

Average distance in cluster 0 [1]: 0.000000
Average distance in cluster 1 [18775]: 63.839819
Average distance in cluster 2 [11]: 448.706077
Average distance in cluster 3 [1]: 0.000000
Average distance in cluster 4 [8]: 213.629578
Average distance in cluster 5 [1]: 0.000000
Average distance in cluster 6 [10]: 369.592682
Average distance in cluster 7 [1]: 0.000000
Average distance in cluster 8 [2]: 31.061103
Average distance in cluster 9 [1]: 0.000000
Average distance in cluster 10 [2]: 309.934857
Average distance in cluster 11 [1]: 0.000000
Average distance in cluster 12 [1]: 0.000000
Average distance in cluster 13 [1]: 0.000000
Average distance in cluster 14 [1]: 0.000000
Average distance in cluster 15 [4]: 229.180504
Average distance in cluster 16 [1]: 0.000000
Average distance in cluster 17 [3]: 336.835246
Average distance in cluster 18 [2]: 76.485594
Average distance in cluster 19 [1]: 0.000000
Num clusters: 20; maxDistance: 724.060033

I'll have to recheck. :/

On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <te...@gmail.com> wrote:
> Hot damn!
>
> Well spotted.
>
> On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon
> <da...@gmail.com>wrote:
>
>> Ted, remember we talked about this last week?
>>
>> The problem was (I think it's fixed now) that when I was asking for 20
>> clusters, every mapper would give me 20 clusters (rather than k log n
>> ~ 200) and the points clumped together resulting in one cluster with
>> the vast majority of the points ~17K out the ~19K.
>>
>> Now that I fixed that added more tests that seem to be confirming all
>> StreamingKMeans implementations get about the same results (whether
>> they're local or MapReduce) and the multiple restarts of BallKMeans,
>> I'm expecting it to be a lot better.
>>
>> Actual data tests coming soon (please check that new cluster thread). :)
>>

Re: Clustering 20newsgroups with StreamingKMeans [was How to improve clustering?]

Posted by Ted Dunning <te...@gmail.com>.
Hot damn!

Well spotted.

On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon
<da...@gmail.com>wrote:

> Ted, remember we talked about this last week?
>
> The problem was (I think it's fixed now) that when I was asking for 20
> clusters, every mapper would give me 20 clusters (rather than k log n
> ~ 200) and the points clumped together resulting in one cluster with
> the vast majority of the points ~17K out the ~19K.
>
> Now that I fixed that added more tests that seem to be confirming all
> StreamingKMeans implementations get about the same results (whether
> they're local or MapReduce) and the multiple restarts of BallKMeans,
> I'm expecting it to be a lot better.
>
> Actual data tests coming soon (please check that new cluster thread). :)
>