You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Suneel Marthi <su...@yahoo.com> on 2013/12/23 22:14:32 UTC

Streaming KMeans clustering

Has anyone be successful running Streaming KMeans clustering on a large dataset (> 100,000 points)?


It just seems to take a very long time (> 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off.

Its the same story when trying to run in sequential mode.

Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. 


There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst().

We really need to have this working on datasets that are larger than 20K reuters datasets.

I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.

Re: Streaming KMeans clustering

Posted by Suneel Marthi <su...@yahoo.com>.
@Johannes, I didn't quite get reading your 2 emails if Streaming kmeans worked for you or not? What were the issues you had identified with pending additions and projection?






On Wednesday, December 25, 2013 5:40 AM, Johannes Schulte <jo...@gmail.com> wrote:
 
Hey Sebastian,

it was a text like clustering problem with a dimensionality of 100 000, the
number of data points could have have been million but i always cancelled
it after a while (i used the java classes, not the command line version and
monitored the progress).

As for my statements above: They are possibly not quite correct. Sure, the
projection search reduces the amount of searching needed, but by the time i
looked into the code, i identified two problems, if i remember correctly:

- the searching of pending additions
- the projection itself


but i'll have to retry that and look into the code again. i ended up using
the old k-means code on a sample of the data..

cheers,

johannes



On Wed, Dec 25, 2013 at 11:17 AM, Sebastian Schelter <ss...@apache.org> wrote:

> Hi Johannes,
>
> can you share some details about the dataset that you ran streaming
> k-means on (number of datapoints, cardinality, etc)?
>
> @Ted/Suneel Shouldn't the approximate searching techniques (e.g.
> projection search) help cope with high dimensional inputs?
>
> --sebastian
>
>
> On 25.12.2013 10:42, Johannes Schulte wrote:
> > Hi,
> >
> > i also had problems getting up to speed but i made the cardinality of the
> > vectors responsible for that. i didn't do the math exactly but while
> > streaming k-means improves over regular k-means in using log(k) and
> > (n_umber of datapoints / k) passes, the d_imension parameter from the
> > original k*d*n stays untouched, right?
> >
> > What is your vector's cardinality?
> >
> >
> > On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi <suneel_marthi@yahoo.com
> >wrote:
> >
> >> Ted,
> >>
> >> What were the CLI parameters when you ran this test for 1M points - no.
> of
> >> clusters k, km, distanceMeasure, projectionSearch,
> estimatedDistanceCutoff?
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <
> ted.dunning@gmail.com>
> >> wrote:
> >>
> >> For reference, on a 16 core machine, I was able to run the sequential
> >> version of streaming k-means on 1,000,000 points, each with 10
> dimensions
> >> in about 20 seconds.  The map-reduce versions are comparable subject to
> >> scaling except for startup time.
> >>
> >>
> >>
> >> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <ss...@apache.org>
> >> wrote:
> >>
> >>> That the algorithm runs a single reducer is expected. The algorithm
> >>> creates a sketch of the data in parallel in the map-phase, which is
> >>> collected by the reducer afterwards. The reducer then applies an
> >>> expensive in-memory clustering algorithm to the sketch.
> >>>
> >>> Which dataset are you using for testing? I can also do some tests on a
> >>> cluster here.
> >>>
> >>> I can imagine two possible causes for the problems: Maybe there's a
> >>> problem with the vectors and some calculations take very long because
> >>> the wrong access pattern or implementation is chosen.
> >>>
> >>> Another problem could be that the mappers and reducers have too few
> >>> memory and spend a lot of time running garbage collections.
> >>>
> >>> --sebastian
> >>>
> >>>
> >>> On 23.12.2013 22:14, Suneel Marthi wrote:
> >>>> Has anyone be successful running Streaming KMeans clustering on a
> large
> >>> dataset (> 100,000 points)?
> >>>>
> >>>>
> >>>> It just seems to take a very long time (> 4hrs) for the mappers to
> >>> finish on about 300K data points and the reduce phase has only a single
> >>> reducer running and throws an OOM failing the job several hours after
> the
> >>> job has been kicked off.
> >>>>
> >>>> Its the same story when trying to run in sequential mode.
> >>>>
> >>>> Looking at the code the bottleneck seems to be in
> >>> StreamingKMeans.clusterInternal(), without understanding the behaviour
> of
> >>> the algorithm I am not sure if the sequence of steps in there is
> correct.
> >>>>
> >>>>
> >>>> There are few calls that call themselves repeatedly over and over
> again
> >>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
> >>>>
> >>>> We really need to have this working on datasets that are larger than
> >> 20K
> >>> reuters datasets.
> >>>>
> >>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and
> >>> FastProjectSearch.
> >>>>
> >>>
> >>>
> >>
> >
>
>

Re: Streaming KMeans clustering

Posted by Johannes Schulte <jo...@gmail.com>.
Hey Sebastian,

it was a text like clustering problem with a dimensionality of 100 000, the
number of data points could have have been million but i always cancelled
it after a while (i used the java classes, not the command line version and
monitored the progress).

As for my statements above: They are possibly not quite correct. Sure, the
projection search reduces the amount of searching needed, but by the time i
looked into the code, i identified two problems, if i remember correctly:

- the searching of pending additions
- the projection itself


but i'll have to retry that and look into the code again. i ended up using
the old k-means code on a sample of the data..

cheers,

johannes


On Wed, Dec 25, 2013 at 11:17 AM, Sebastian Schelter <ss...@apache.org> wrote:

> Hi Johannes,
>
> can you share some details about the dataset that you ran streaming
> k-means on (number of datapoints, cardinality, etc)?
>
> @Ted/Suneel Shouldn't the approximate searching techniques (e.g.
> projection search) help cope with high dimensional inputs?
>
> --sebastian
>
>
> On 25.12.2013 10:42, Johannes Schulte wrote:
> > Hi,
> >
> > i also had problems getting up to speed but i made the cardinality of the
> > vectors responsible for that. i didn't do the math exactly but while
> > streaming k-means improves over regular k-means in using log(k) and
> > (n_umber of datapoints / k) passes, the d_imension parameter from the
> > original k*d*n stays untouched, right?
> >
> > What is your vector's cardinality?
> >
> >
> > On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi <suneel_marthi@yahoo.com
> >wrote:
> >
> >> Ted,
> >>
> >> What were the CLI parameters when you ran this test for 1M points - no.
> of
> >> clusters k, km, distanceMeasure, projectionSearch,
> estimatedDistanceCutoff?
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <
> ted.dunning@gmail.com>
> >> wrote:
> >>
> >> For reference, on a 16 core machine, I was able to run the sequential
> >> version of streaming k-means on 1,000,000 points, each with 10
> dimensions
> >> in about 20 seconds.  The map-reduce versions are comparable subject to
> >> scaling except for startup time.
> >>
> >>
> >>
> >> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <ss...@apache.org>
> >> wrote:
> >>
> >>> That the algorithm runs a single reducer is expected. The algorithm
> >>> creates a sketch of the data in parallel in the map-phase, which is
> >>> collected by the reducer afterwards. The reducer then applies an
> >>> expensive in-memory clustering algorithm to the sketch.
> >>>
> >>> Which dataset are you using for testing? I can also do some tests on a
> >>> cluster here.
> >>>
> >>> I can imagine two possible causes for the problems: Maybe there's a
> >>> problem with the vectors and some calculations take very long because
> >>> the wrong access pattern or implementation is chosen.
> >>>
> >>> Another problem could be that the mappers and reducers have too few
> >>> memory and spend a lot of time running garbage collections.
> >>>
> >>> --sebastian
> >>>
> >>>
> >>> On 23.12.2013 22:14, Suneel Marthi wrote:
> >>>> Has anyone be successful running Streaming KMeans clustering on a
> large
> >>> dataset (> 100,000 points)?
> >>>>
> >>>>
> >>>> It just seems to take a very long time (> 4hrs) for the mappers to
> >>> finish on about 300K data points and the reduce phase has only a single
> >>> reducer running and throws an OOM failing the job several hours after
> the
> >>> job has been kicked off.
> >>>>
> >>>> Its the same story when trying to run in sequential mode.
> >>>>
> >>>> Looking at the code the bottleneck seems to be in
> >>> StreamingKMeans.clusterInternal(), without understanding the behaviour
> of
> >>> the algorithm I am not sure if the sequence of steps in there is
> correct.
> >>>>
> >>>>
> >>>> There are few calls that call themselves repeatedly over and over
> again
> >>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
> >>>>
> >>>> We really need to have this working on datasets that are larger than
> >> 20K
> >>> reuters datasets.
> >>>>
> >>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and
> >>> FastProjectSearch.
> >>>>
> >>>
> >>>
> >>
> >
>
>

Re: Streaming KMeans clustering

Posted by Suneel Marthi <su...@yahoo.com>.





On Wednesday, December 25, 2013 5:20 AM, Sebastian Schelter <ss...@apache.org> wrote:
 
Hi Johannes,

can you share some details about the dataset that you ran streaming
k-means on (number of datapoints, cardinality, etc)?

@Ted/Suneel Shouldn't the approximate searching techniques (e.g.
projection search) help cope with high dimensional inputs?

>>> @ssc, that's my understanding too from reading the reference paper for this impl.

--sebastian



On 25.12.2013 10:42, Johannes Schulte wrote:
> Hi,
> 
> i also had problems getting up to speed but i made the cardinality of the
> vectors responsible for that. i didn't do the math exactly but while
> streaming k-means improves over regular k-means in using log(k) and
> (n_umber of datapoints / k) passes, the d_imension parameter from the
> original k*d*n stays untouched, right?
> 
> What is your vector's cardinality?
> 
> 
> On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi <su...@yahoo.com>wrote:
> 
>> Ted,
>>
>> What were the CLI parameters when you ran this test for 1M points - no. of
>> clusters k, km, distanceMeasure, projectionSearch, estimatedDistanceCutoff?
>>
>>
>>
>>
>>
>>
>>
>> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>> For reference, on a 16 core machine, I was able to run the sequential
>> version of streaming k-means on 1,000,000 points, each with 10 dimensions
>> in about 20 seconds.  The map-reduce versions are comparable subject to
>> scaling except for startup time.
>>
>>
>>
>> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <ss...@apache.org>
>> wrote:
>>
>>> That the algorithm runs a single reducer is expected. The algorithm
>>> creates a sketch of the data in parallel in the map-phase, which is
>>> collected by the reducer afterwards. The reducer then applies an
>>> expensive in-memory clustering algorithm to the sketch.
>>>
>>> Which dataset are you using for testing? I can also do some tests on a
>>> cluster here.
>>>
>>> I can imagine two possible causes for the problems: Maybe there's a
>>> problem with the vectors and some calculations take very long because
>>> the wrong access pattern or implementation is chosen.
>>>
>>> Another problem could be that the mappers and reducers have too few
>>> memory and spend a lot of time running garbage collections.
>>>
>>> --sebastian
>>>
>>>
>>> On 23.12.2013 22:14, Suneel Marthi wrote:
>>>> Has anyone be successful running Streaming KMeans clustering on a large
>>> dataset (> 100,000 points)?
>>>>
>>>>
>>>> It just seems to take a very long time (> 4hrs) for the mappers to
>>> finish on about 300K data points and the reduce phase has only a single
>>> reducer running and throws an OOM failing the job several hours after the
>>> job has been kicked off.
>>>>
>>>> Its the same story when trying to run in sequential mode.
>>>>
>>>> Looking at the code the bottleneck seems to be in
>>> StreamingKMeans.clusterInternal(), without understanding the behaviour of
>>> the algorithm I am not sure if the sequence of steps in there is correct.
>>>>
>>>>
>>>> There are few calls that call themselves repeatedly over and over again
>>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
>>>>
>>>> We really need to have this working on datasets that are larger than
>> 20K
>>> reuters datasets.
>>>>
>>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and
>>> FastProjectSearch.
>>>>
>>>
>>>
>>
> 

Re: Streaming KMeans clustering

Posted by Sebastian Schelter <ss...@apache.org>.
Hi Johannes,

can you share some details about the dataset that you ran streaming
k-means on (number of datapoints, cardinality, etc)?

@Ted/Suneel Shouldn't the approximate searching techniques (e.g.
projection search) help cope with high dimensional inputs?

--sebastian


On 25.12.2013 10:42, Johannes Schulte wrote:
> Hi,
> 
> i also had problems getting up to speed but i made the cardinality of the
> vectors responsible for that. i didn't do the math exactly but while
> streaming k-means improves over regular k-means in using log(k) and
> (n_umber of datapoints / k) passes, the d_imension parameter from the
> original k*d*n stays untouched, right?
> 
> What is your vector's cardinality?
> 
> 
> On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi <su...@yahoo.com>wrote:
> 
>> Ted,
>>
>> What were the CLI parameters when you ran this test for 1M points - no. of
>> clusters k, km, distanceMeasure, projectionSearch, estimatedDistanceCutoff?
>>
>>
>>
>>
>>
>>
>>
>> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>> For reference, on a 16 core machine, I was able to run the sequential
>> version of streaming k-means on 1,000,000 points, each with 10 dimensions
>> in about 20 seconds.  The map-reduce versions are comparable subject to
>> scaling except for startup time.
>>
>>
>>
>> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <ss...@apache.org>
>> wrote:
>>
>>> That the algorithm runs a single reducer is expected. The algorithm
>>> creates a sketch of the data in parallel in the map-phase, which is
>>> collected by the reducer afterwards. The reducer then applies an
>>> expensive in-memory clustering algorithm to the sketch.
>>>
>>> Which dataset are you using for testing? I can also do some tests on a
>>> cluster here.
>>>
>>> I can imagine two possible causes for the problems: Maybe there's a
>>> problem with the vectors and some calculations take very long because
>>> the wrong access pattern or implementation is chosen.
>>>
>>> Another problem could be that the mappers and reducers have too few
>>> memory and spend a lot of time running garbage collections.
>>>
>>> --sebastian
>>>
>>>
>>> On 23.12.2013 22:14, Suneel Marthi wrote:
>>>> Has anyone be successful running Streaming KMeans clustering on a large
>>> dataset (> 100,000 points)?
>>>>
>>>>
>>>> It just seems to take a very long time (> 4hrs) for the mappers to
>>> finish on about 300K data points and the reduce phase has only a single
>>> reducer running and throws an OOM failing the job several hours after the
>>> job has been kicked off.
>>>>
>>>> Its the same story when trying to run in sequential mode.
>>>>
>>>> Looking at the code the bottleneck seems to be in
>>> StreamingKMeans.clusterInternal(), without understanding the behaviour of
>>> the algorithm I am not sure if the sequence of steps in there is correct.
>>>>
>>>>
>>>> There are few calls that call themselves repeatedly over and over again
>>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
>>>>
>>>> We really need to have this working on datasets that are larger than
>> 20K
>>> reuters datasets.
>>>>
>>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and
>>> FastProjectSearch.
>>>>
>>>
>>>
>>
> 


Re: Streaming KMeans clustering

Posted by Suneel Marthi <su...@yahoo.com>.
@Johannes, how many datapoints did u have in ur test?  Since the Streaming KMeans runs through a single reducer how much memory did u have to allocate if u had like a million data points?  What was the expectedDistanceCutoff you had?

@All, My experience so far has been that once you are done with the Mapper phase (for over a million datapoints), the Reducer fails mostly with OOM errors depending on the number of clusters (or datapoints?) that are specified.

It could be argued that the initial choice of 'k' to begin with wasn't right, well yes how does that make this any different from the old fashioned Canopy -> KMeans way of clustering (atleast we had a better estimate of 'k' that way), not to mention that users always had OOM issues with the single Reducer in Canopy (and has been reported by several users on user@ and dev@).







On Wednesday, December 25, 2013 4:42 AM, Johannes Schulte <jo...@gmail.com> wrote:
 
Hi,

i also had problems getting up to speed but i made the cardinality of the vectors responsible for that. i didn't do the math exactly but while streaming k-means improves over regular k-means in using log(k) and (n_umber of datapoints / k) passes, the d_imension parameter from the original k*d*n stays untouched, right?

What is your vector's cardinality?



On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi <su...@yahoo.com> wrote:

Ted,
>
>What were the CLI parameters when you ran this test for 1M points - no. of clusters k, km, distanceMeasure, projectionSearch, estimatedDistanceCutoff?
>
>
>
>
>
>
>
>
>On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <te...@gmail.com> wrote:
>
>For reference, on a 16 core machine, I was able to run the sequential
>version of streaming k-means on 1,000,000 points, each with 10 dimensions
>in about 20 seconds.  The map-reduce versions are comparable subject to
>scaling except for startup time.
>
>
>
>On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <ss...@apache.org> wrote:
>
>> That the algorithm runs a single reducer is expected. The algorithm
>> creates a sketch of the data in parallel in the map-phase, which is
>> collected by the reducer afterwards. The reducer then applies an
>> expensive in-memory clustering algorithm to the sketch.
>>
>> Which dataset are you using for testing? I can also do some tests on a
>> cluster here.
>>
>> I can imagine two possible causes for the problems: Maybe there's a
>> problem with the vectors and some calculations take very long because
>> the wrong access pattern or implementation is chosen.
>>
>> Another problem could be that the mappers and reducers have too few
>> memory and spend a lot of time running garbage collections.
>>
>> --sebastian
>>
>>
>> On 23.12.2013 22:14, Suneel Marthi wrote:
>> > Has anyone be successful running Streaming KMeans clustering on a large
>> dataset (> 100,000 points)?
>> >
>> >
>> > It just seems to take a very long time (> 4hrs) for the mappers to
>> finish on about 300K data points and the reduce phase has only a single
>> reducer running and throws an OOM failing the job several hours after the
>> job has been kicked off.
>> >
>> > Its the same story when trying to run in sequential mode.
>> >
>> > Looking at the code the bottleneck seems to be in
>> StreamingKMeans.clusterInternal(), without understanding the behaviour of
>> the algorithm I am not sure if the sequence of steps in there is correct.
>> >
>> >
>> > There are few calls that call themselves repeatedly over and over again
>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
>> >
>> > We really need to have this working on datasets that are larger than 20K
>> reuters datasets.
>> >
>> > I am trying to run this on 300K vectors with k= 100, km = 1261 and
>> FastProjectSearch.
>> >
>>
>>

Re: Streaming KMeans clustering

Posted by Johannes Schulte <jo...@gmail.com>.
Hi,

i also had problems getting up to speed but i made the cardinality of the
vectors responsible for that. i didn't do the math exactly but while
streaming k-means improves over regular k-means in using log(k) and
(n_umber of datapoints / k) passes, the d_imension parameter from the
original k*d*n stays untouched, right?

What is your vector's cardinality?


On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi <su...@yahoo.com>wrote:

> Ted,
>
> What were the CLI parameters when you ran this test for 1M points - no. of
> clusters k, km, distanceMeasure, projectionSearch, estimatedDistanceCutoff?
>
>
>
>
>
>
>
> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> For reference, on a 16 core machine, I was able to run the sequential
> version of streaming k-means on 1,000,000 points, each with 10 dimensions
> in about 20 seconds.  The map-reduce versions are comparable subject to
> scaling except for startup time.
>
>
>
> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <ss...@apache.org>
> wrote:
>
> > That the algorithm runs a single reducer is expected. The algorithm
> > creates a sketch of the data in parallel in the map-phase, which is
> > collected by the reducer afterwards. The reducer then applies an
> > expensive in-memory clustering algorithm to the sketch.
> >
> > Which dataset are you using for testing? I can also do some tests on a
> > cluster here.
> >
> > I can imagine two possible causes for the problems: Maybe there's a
> > problem with the vectors and some calculations take very long because
> > the wrong access pattern or implementation is chosen.
> >
> > Another problem could be that the mappers and reducers have too few
> > memory and spend a lot of time running garbage collections.
> >
> > --sebastian
> >
> >
> > On 23.12.2013 22:14, Suneel Marthi wrote:
> > > Has anyone be successful running Streaming KMeans clustering on a large
> > dataset (> 100,000 points)?
> > >
> > >
> > > It just seems to take a very long time (> 4hrs) for the mappers to
> > finish on about 300K data points and the reduce phase has only a single
> > reducer running and throws an OOM failing the job several hours after the
> > job has been kicked off.
> > >
> > > Its the same story when trying to run in sequential mode.
> > >
> > > Looking at the code the bottleneck seems to be in
> > StreamingKMeans.clusterInternal(), without understanding the behaviour of
> > the algorithm I am not sure if the sequence of steps in there is correct.
> > >
> > >
> > > There are few calls that call themselves repeatedly over and over again
> > like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
> > >
> > > We really need to have this working on datasets that are larger than
> 20K
> > reuters datasets.
> > >
> > > I am trying to run this on 300K vectors with k= 100, km = 1261 and
> > FastProjectSearch.
> > >
> >
> >
>

Re: Streaming KMeans clustering

Posted by Suneel Marthi <su...@yahoo.com>.
Ted,

What were the CLI parameters when you ran this test for 1M points - no. of clusters k, km, distanceMeasure, projectionSearch, estimatedDistanceCutoff? 







On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <te...@gmail.com> wrote:
 
For reference, on a 16 core machine, I was able to run the sequential
version of streaming k-means on 1,000,000 points, each with 10 dimensions
in about 20 seconds.  The map-reduce versions are comparable subject to
scaling except for startup time.



On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <ss...@apache.org> wrote:

> That the algorithm runs a single reducer is expected. The algorithm
> creates a sketch of the data in parallel in the map-phase, which is
> collected by the reducer afterwards. The reducer then applies an
> expensive in-memory clustering algorithm to the sketch.
>
> Which dataset are you using for testing? I can also do some tests on a
> cluster here.
>
> I can imagine two possible causes for the problems: Maybe there's a
> problem with the vectors and some calculations take very long because
> the wrong access pattern or implementation is chosen.
>
> Another problem could be that the mappers and reducers have too few
> memory and spend a lot of time running garbage collections.
>
> --sebastian
>
>
> On 23.12.2013 22:14, Suneel Marthi wrote:
> > Has anyone be successful running Streaming KMeans clustering on a large
> dataset (> 100,000 points)?
> >
> >
> > It just seems to take a very long time (> 4hrs) for the mappers to
> finish on about 300K data points and the reduce phase has only a single
> reducer running and throws an OOM failing the job several hours after the
> job has been kicked off.
> >
> > Its the same story when trying to run in sequential mode.
> >
> > Looking at the code the bottleneck seems to be in
> StreamingKMeans.clusterInternal(), without understanding the behaviour of
> the algorithm I am not sure if the sequence of steps in there is correct.
> >
> >
> > There are few calls that call themselves repeatedly over and over again
> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
> >
> > We really need to have this working on datasets that are larger than 20K
> reuters datasets.
> >
> > I am trying to run this on 300K vectors with k= 100, km = 1261 and
> FastProjectSearch.
> >
>
>

Re: Streaming KMeans clustering

Posted by Suneel Marthi <su...@yahoo.com>.
There is no combiner in the present implementation.  Moreover the codepath that's executed when 'reduceStreamingKMeans' -rskm flag is set does not have adequate test coverage and needs to be tested more extensively. Most of the issues I had been seeing were due to specifying -rskm flag.  

Amir had provided a dataset with about 300K points, could someone try running Streaming KMeans on this - both mapreduce and sequential versions? I have had no luck with either version. Here is the link to the dataset - http://gluegadget.com/split-vectors.tar.bz2





On Thursday, December 26, 2013 3:02 PM, Ted Dunning <te...@gmail.com> wrote:
 


On Thu, Dec 26, 2013 at 10:19 AM, Suneel Marthi <su...@yahoo.com> wrote:

I heard people outside of dev@ and user@ who have tried running Streaming KMeans (from 0.8) on their Production clusters on large datasets and had seen the job crash in the Reduce phase due to OOM errors (this is with -Xmx2GB).
Excessive memory usage in reduce was a known bug that was addressed (supposedly) by using a combiner.

This really smells like bug resurrection happened somehow.  Clearly that also means that our unit tests are insufficient.

Re: Streaming KMeans clustering

Posted by Ted Dunning <te...@gmail.com>.
On Thu, Dec 26, 2013 at 10:19 AM, Suneel Marthi <su...@yahoo.com>wrote:

> I heard people outside of dev@ and user@ who have tried running Streaming
> KMeans (from 0.8) on their Production clusters on large datasets and had
> seen the job crash in the Reduce phase due to OOM errors (this is with
> -Xmx2GB).
>

Excessive memory usage in reduce was a known bug that was addressed
(supposedly) by using a combiner.

This really smells like bug resurrection happened somehow.  Clearly that
also means that our unit tests are insufficient.

Re: Streaming KMeans clustering

Posted by Johannes Schulte <jo...@gmail.com>.
Right. Up until now i'am helping myself with some minDf truncation since i
am using tf idf weighted vectors anyway and have the idf counts at hand.
But having a true loss driven sparsification might be better


On Sun, Dec 29, 2013 at 11:43 PM, Ted Dunning <te...@gmail.com> wrote:

> Johannes,
>
> One thing that might be of real interest is something that I haven't tried,
> nor read about.  It should still be interesting.
>
> The idea is L_1 regularized clustering.  This will give you (hopefully)
> sparse centroids that still are useful.  In your example, where you want
> defensible (aka salable) clusters, it would make the clusters much easier
> to understand.  It would also make them take less memory.
>
>
>
> On Sun, Dec 29, 2013 at 1:55 PM, Johannes Schulte <
> johannes.schulte@gmail.com> wrote:
>
> > Ted, thanks for the long response! I agree with you on the benefit of a
> lot
> > of clusters. the reason i chose 10 is not because i think there are truly
> > 10 clusters in the data (there are probably thousands, e-commerce site
> > behavioural data), but for technical reasons:
> >
> > - the cluster distances are used in a probability prediction model with
> > rare events data, so i want every cluster to contain at least some
> positive
> > examples
> > - the cluster centroids need to be kept in memory online for real time
> > feature vector generation and olr scoring. as the vectors are quite big
> and
> > there are ~100 clients to handle simultaneously (100 clients * 10
> clusters
> > * ~10000 non zero features per cluster centroid * 8 byte per vector
> > element)
> > - when "selling" the clusters, visualized, it's easier to plot something
> > with only 10 clusters
> >
> > i'll give it another try with more clusters on the sketch phase and see
> > what i can achieve. thanks for your help!
> >
> >
> > On Sun, Dec 29, 2013 at 1:32 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > >
> > > On Sat, Dec 28, 2013 at 1:10 PM, Johannes Schulte <
> > > johannes.schulte@gmail.com> wrote:
> > >
> > >> Okay, understood. For a lot of clusters (which i don't necessarily
> > >> attribute to big data problems but definetly to nearest neighbour like
> > >> usage of clusters), the every "cluster sees every point" doesnt scale
> > >> well.
> > >>
> > >
> > >
> > > Nearest neighbor sorts of problems.  Or market segmentation kinds of
> > > problems.  Or feature generation kinds of problems.  Or volume
> > quantization
> > > kinds of problems.
> > >
> > > The point is that with more data, you can derive a more detailed model.
> > >  That means more clusters.
> > >
> > > The ideal case is that we have an infinite mixture model to describe
> the
> > > data distribution, but we haven't seen most of the mixture components
> > yet.
> > >  As we get more data, we can justify saying we have more components.
> > >
> > > Even if you think that there are *really* 10 clusters in the data, I
> can
> > > make the case that you are going to get a better unsupervised
> description
> > > of those clusters by using 100 or more clusters in the k-means
> algorithm.
> > >  The idea is that each of your real clusters will be composed of
> several
> > of
> > > the k-means clusters and finding the attachments is much easier because
> > 100
> > > clusters is much smaller than the original number of samples.
> > >
> > > As a small data example, suppose you are looking at Anderson's Iris
> data
> > > which is available built into R.  If we plot the 150 data points
> against
> > > the features we have in pairs, we can see various patterns and see
> that a
> > > non-linear classifier should do quite well in separating the classes (I
> > > hope the image makes it through the emailer):
> > >
> > > [image: Inline image 1]
> > >
> > > But if we try to do k-means on these data with only 3 clusters, we get
> > > very poor assignment to the three species:
> > >
> > >
> > > *> k = kmeans(iris[,1:4], centers=3, nstart=10)*
> > > *> table(iris$Species, k$cluster)*
> > >
> > >               cluster
> > >               1  2  3
> > >   setosa*     50  0  0*
> > >   versicolor*  0 48  2*
> > >   virginica*   0 14 36*
> > >
> > > Cluster 1 captured the isolated setosa species rather well, but
> > versicolor
> > > and virginica are not well separated because cluster 2 has 80% of
> > > versicolor and 20% of virginica.
> > >
> > > On the other hand, if we use 7 clusters,
> > >
> > >
> > > *> k = kmeans(iris[,1:4], centers=7, nstart=10)*
> > >
> > > *> table(iris$Species, k$cluster)*
> > >
> > >                    cluster
> > >               1  2  3  4  5  6  7
> > >   setosa*      0  0 28  0 22  0  0*
> > >   versicolor*  0  7  0 20  0  0 23*
> > >   virginica*  12  0  0  1  0 24 13*
> > >
> > > Each cluster is now composed of almost exactly one species.  Only
> cluster
> > > 4 has any impurity and it is 95% composed of just versicolor samples.
> > >
> > > What this means is that we can use the 7 cluster k-means results to
> build
> > > a classifier that has a 1 of 7 input feature (cluster id) instead of 4
> > real
> > > values.  That is, we have compressed the 4 original continuous features
> > > down to about 2.7 bits on average and this compressed representation
> > > actually makes building a classifier nearly trivial.
> > >
> > > *> -sum(table(k$cluster) * log(table(k$cluster)/ 150)/150)/log(2)*
> > > *2.670288*
> > >
> > > This is pretty cool compression given that the original continuous data
> > > had about 22 bits of raw information capacity based on the range and
> > > precision given.
> > >
> > > Now, in the real world, we would need to hold out some data for cross
> > > validation of the clustering, but the fact remains that if you want to
> > use
> > > the k-means clustering for some machine oriented purpose, it pays to
> have
> > > as many clusters as you can have, subject to the held-out data agreeing
> > > with the original data on distance distributions and counts for the
> > > clusters.
> > >
> >
>

Re: Streaming KMeans clustering

Posted by Ted Dunning <te...@gmail.com>.
Johannes,

One thing that might be of real interest is something that I haven't tried,
nor read about.  It should still be interesting.

The idea is L_1 regularized clustering.  This will give you (hopefully)
sparse centroids that still are useful.  In your example, where you want
defensible (aka salable) clusters, it would make the clusters much easier
to understand.  It would also make them take less memory.



On Sun, Dec 29, 2013 at 1:55 PM, Johannes Schulte <
johannes.schulte@gmail.com> wrote:

> Ted, thanks for the long response! I agree with you on the benefit of a lot
> of clusters. the reason i chose 10 is not because i think there are truly
> 10 clusters in the data (there are probably thousands, e-commerce site
> behavioural data), but for technical reasons:
>
> - the cluster distances are used in a probability prediction model with
> rare events data, so i want every cluster to contain at least some positive
> examples
> - the cluster centroids need to be kept in memory online for real time
> feature vector generation and olr scoring. as the vectors are quite big and
> there are ~100 clients to handle simultaneously (100 clients * 10 clusters
> * ~10000 non zero features per cluster centroid * 8 byte per vector
> element)
> - when "selling" the clusters, visualized, it's easier to plot something
> with only 10 clusters
>
> i'll give it another try with more clusters on the sketch phase and see
> what i can achieve. thanks for your help!
>
>
> On Sun, Dec 29, 2013 at 1:32 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> >
> > On Sat, Dec 28, 2013 at 1:10 PM, Johannes Schulte <
> > johannes.schulte@gmail.com> wrote:
> >
> >> Okay, understood. For a lot of clusters (which i don't necessarily
> >> attribute to big data problems but definetly to nearest neighbour like
> >> usage of clusters), the every "cluster sees every point" doesnt scale
> >> well.
> >>
> >
> >
> > Nearest neighbor sorts of problems.  Or market segmentation kinds of
> > problems.  Or feature generation kinds of problems.  Or volume
> quantization
> > kinds of problems.
> >
> > The point is that with more data, you can derive a more detailed model.
> >  That means more clusters.
> >
> > The ideal case is that we have an infinite mixture model to describe the
> > data distribution, but we haven't seen most of the mixture components
> yet.
> >  As we get more data, we can justify saying we have more components.
> >
> > Even if you think that there are *really* 10 clusters in the data, I can
> > make the case that you are going to get a better unsupervised description
> > of those clusters by using 100 or more clusters in the k-means algorithm.
> >  The idea is that each of your real clusters will be composed of several
> of
> > the k-means clusters and finding the attachments is much easier because
> 100
> > clusters is much smaller than the original number of samples.
> >
> > As a small data example, suppose you are looking at Anderson's Iris data
> > which is available built into R.  If we plot the 150 data points against
> > the features we have in pairs, we can see various patterns and see that a
> > non-linear classifier should do quite well in separating the classes (I
> > hope the image makes it through the emailer):
> >
> > [image: Inline image 1]
> >
> > But if we try to do k-means on these data with only 3 clusters, we get
> > very poor assignment to the three species:
> >
> >
> > *> k = kmeans(iris[,1:4], centers=3, nstart=10)*
> > *> table(iris$Species, k$cluster)*
> >
> >               cluster
> >               1  2  3
> >   setosa*     50  0  0*
> >   versicolor*  0 48  2*
> >   virginica*   0 14 36*
> >
> > Cluster 1 captured the isolated setosa species rather well, but
> versicolor
> > and virginica are not well separated because cluster 2 has 80% of
> > versicolor and 20% of virginica.
> >
> > On the other hand, if we use 7 clusters,
> >
> >
> > *> k = kmeans(iris[,1:4], centers=7, nstart=10)*
> >
> > *> table(iris$Species, k$cluster)*
> >
> >                    cluster
> >               1  2  3  4  5  6  7
> >   setosa*      0  0 28  0 22  0  0*
> >   versicolor*  0  7  0 20  0  0 23*
> >   virginica*  12  0  0  1  0 24 13*
> >
> > Each cluster is now composed of almost exactly one species.  Only cluster
> > 4 has any impurity and it is 95% composed of just versicolor samples.
> >
> > What this means is that we can use the 7 cluster k-means results to build
> > a classifier that has a 1 of 7 input feature (cluster id) instead of 4
> real
> > values.  That is, we have compressed the 4 original continuous features
> > down to about 2.7 bits on average and this compressed representation
> > actually makes building a classifier nearly trivial.
> >
> > *> -sum(table(k$cluster) * log(table(k$cluster)/ 150)/150)/log(2)*
> > *2.670288*
> >
> > This is pretty cool compression given that the original continuous data
> > had about 22 bits of raw information capacity based on the range and
> > precision given.
> >
> > Now, in the real world, we would need to hold out some data for cross
> > validation of the clustering, but the fact remains that if you want to
> use
> > the k-means clustering for some machine oriented purpose, it pays to have
> > as many clusters as you can have, subject to the held-out data agreeing
> > with the original data on distance distributions and counts for the
> > clusters.
> >
>

Re: Streaming KMeans clustering

Posted by Johannes Schulte <jo...@gmail.com>.
Ted, thanks for the long response! I agree with you on the benefit of a lot
of clusters. the reason i chose 10 is not because i think there are truly
10 clusters in the data (there are probably thousands, e-commerce site
behavioural data), but for technical reasons:

- the cluster distances are used in a probability prediction model with
rare events data, so i want every cluster to contain at least some positive
examples
- the cluster centroids need to be kept in memory online for real time
feature vector generation and olr scoring. as the vectors are quite big and
there are ~100 clients to handle simultaneously (100 clients * 10 clusters
* ~10000 non zero features per cluster centroid * 8 byte per vector
element)
- when "selling" the clusters, visualized, it's easier to plot something
with only 10 clusters

i'll give it another try with more clusters on the sketch phase and see
what i can achieve. thanks for your help!


On Sun, Dec 29, 2013 at 1:32 AM, Ted Dunning <te...@gmail.com> wrote:

>
> On Sat, Dec 28, 2013 at 1:10 PM, Johannes Schulte <
> johannes.schulte@gmail.com> wrote:
>
>> Okay, understood. For a lot of clusters (which i don't necessarily
>> attribute to big data problems but definetly to nearest neighbour like
>> usage of clusters), the every "cluster sees every point" doesnt scale
>> well.
>>
>
>
> Nearest neighbor sorts of problems.  Or market segmentation kinds of
> problems.  Or feature generation kinds of problems.  Or volume quantization
> kinds of problems.
>
> The point is that with more data, you can derive a more detailed model.
>  That means more clusters.
>
> The ideal case is that we have an infinite mixture model to describe the
> data distribution, but we haven't seen most of the mixture components yet.
>  As we get more data, we can justify saying we have more components.
>
> Even if you think that there are *really* 10 clusters in the data, I can
> make the case that you are going to get a better unsupervised description
> of those clusters by using 100 or more clusters in the k-means algorithm.
>  The idea is that each of your real clusters will be composed of several of
> the k-means clusters and finding the attachments is much easier because 100
> clusters is much smaller than the original number of samples.
>
> As a small data example, suppose you are looking at Anderson's Iris data
> which is available built into R.  If we plot the 150 data points against
> the features we have in pairs, we can see various patterns and see that a
> non-linear classifier should do quite well in separating the classes (I
> hope the image makes it through the emailer):
>
> [image: Inline image 1]
>
> But if we try to do k-means on these data with only 3 clusters, we get
> very poor assignment to the three species:
>
>
> *> k = kmeans(iris[,1:4], centers=3, nstart=10)*
> *> table(iris$Species, k$cluster)*
>
>               cluster
>               1  2  3
>   setosa*     50  0  0*
>   versicolor*  0 48  2*
>   virginica*   0 14 36*
>
> Cluster 1 captured the isolated setosa species rather well, but versicolor
> and virginica are not well separated because cluster 2 has 80% of
> versicolor and 20% of virginica.
>
> On the other hand, if we use 7 clusters,
>
>
> *> k = kmeans(iris[,1:4], centers=7, nstart=10)*
>
> *> table(iris$Species, k$cluster)*
>
>                    cluster
>               1  2  3  4  5  6  7
>   setosa*      0  0 28  0 22  0  0*
>   versicolor*  0  7  0 20  0  0 23*
>   virginica*  12  0  0  1  0 24 13*
>
> Each cluster is now composed of almost exactly one species.  Only cluster
> 4 has any impurity and it is 95% composed of just versicolor samples.
>
> What this means is that we can use the 7 cluster k-means results to build
> a classifier that has a 1 of 7 input feature (cluster id) instead of 4 real
> values.  That is, we have compressed the 4 original continuous features
> down to about 2.7 bits on average and this compressed representation
> actually makes building a classifier nearly trivial.
>
> *> -sum(table(k$cluster) * log(table(k$cluster)/ 150)/150)/log(2)*
> *2.670288*
>
> This is pretty cool compression given that the original continuous data
> had about 22 bits of raw information capacity based on the range and
> precision given.
>
> Now, in the real world, we would need to hold out some data for cross
> validation of the clustering, but the fact remains that if you want to use
> the k-means clustering for some machine oriented purpose, it pays to have
> as many clusters as you can have, subject to the held-out data agreeing
> with the original data on distance distributions and counts for the
> clusters.
>

Re: Streaming KMeans clustering

Posted by Ted Dunning <te...@gmail.com>.
On Wed, Jan 1, 2014 at 6:23 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> > On the other hand, if we use 7 clusters,
> >
> >
> > *> k = kmeans(iris[,1:4], centers=7, nstart=10)*
> >
> > *> table(iris$Species, k$cluster)*
> >
> >                    cluster
> >               1  2  3  4  5  6  7
> >   setosa*      0  0 28  0 22  0  0*
> >   versicolor*  0  7  0 20  0  0 23*
> >   virginica*  12  0  0  1  0 24 13*
> >
> > Each cluster is now composed of almost exactly one species.  Only cluster
> > 4 has any impurity and it is 95% composed of just versicolor samples.
> >
> @Ted,
>
> How about cluster 7? it seems it is not as a demonstrable improvement, or i
> don't get something
>

That is a big shock.  I don't remember seeing cluster 7 that way.  I wonder
if I re-ran the numbers one last time to put in the email (kmeans is
non-deterministic)

Here is another run that shows the desired effect

> table(iris$Species, k$cluster)
>
>               1  2  3  4  5  6  7
>   setosa     28  0  0 22  0  0  0
>   versicolor  0  0  4  0  0 27 19
>   virginica   0 12 15  0 22  1  0




In looking back at my transcript of what I did, I only see the version that
I sent out earlier.  I have used this example several times so I now think
that I must have been seeing what I expected to see rather than what was
there.

In more experiments I find that with 50 restarts of k-means, the
sub-optimal solution comes up very rarely.  With 2 restarts, it comes up
much more frequently.

This non-determinism is an excellent motivation for cross validation.  And
a suggestion to the wise that you re-run your k-means algorithms many times
(which makes streaming k-means look even better since it makes up for
restarts with more clusters).

Thanks for the eagle-eyes!

Re: Streaming KMeans clustering

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
> On the other hand, if we use 7 clusters,
>
>
> *> k = kmeans(iris[,1:4], centers=7, nstart=10)*
>
> *> table(iris$Species, k$cluster)*
>
>                    cluster
>               1  2  3  4  5  6  7
>   setosa*      0  0 28  0 22  0  0*
>   versicolor*  0  7  0 20  0  0 23*
>   virginica*  12  0  0  1  0 24 13*
>
> Each cluster is now composed of almost exactly one species.  Only cluster
> 4 has any impurity and it is 95% composed of just versicolor samples.
>
@Ted,

How about cluster 7? it seems it is not as a demonstrable improvement, or i
don't get something

Re: Streaming KMeans clustering

Posted by Ted Dunning <te...@gmail.com>.
Here is a link to the image.  Nothing very exciting.

https://dl.dropboxusercontent.com/u/36863361/iris.png




On Sat, Dec 28, 2013 at 4:32 PM, Ted Dunning <te...@gmail.com> wrote:

>
> On Sat, Dec 28, 2013 at 1:10 PM, Johannes Schulte <
> johannes.schulte@gmail.com> wrote:
>
>> Okay, understood. For a lot of clusters (which i don't necessarily
>> attribute to big data problems but definetly to nearest neighbour like
>> usage of clusters), the every "cluster sees every point" doesnt scale
>> well.
>>
>
>
> Nearest neighbor sorts of problems.  Or market segmentation kinds of
> problems.  Or feature generation kinds of problems.  Or volume quantization
> kinds of problems.
>
> The point is that with more data, you can derive a more detailed model.
>  That means more clusters.
>
> The ideal case is that we have an infinite mixture model to describe the
> data distribution, but we haven't seen most of the mixture components yet.
>  As we get more data, we can justify saying we have more components.
>
> Even if you think that there are *really* 10 clusters in the data, I can
> make the case that you are going to get a better unsupervised description
> of those clusters by using 100 or more clusters in the k-means algorithm.
>  The idea is that each of your real clusters will be composed of several of
> the k-means clusters and finding the attachments is much easier because 100
> clusters is much smaller than the original number of samples.
>
> As a small data example, suppose you are looking at Anderson's Iris data
> which is available built into R.  If we plot the 150 data points against
> the features we have in pairs, we can see various patterns and see that a
> non-linear classifier should do quite well in separating the classes (I
> hope the image makes it through the emailer):
>
> [image: Inline image 1]
>
> But if we try to do k-means on these data with only 3 clusters, we get
> very poor assignment to the three species:
>
>
> *> k = kmeans(iris[,1:4], centers=3, nstart=10)*
> *> table(iris$Species, k$cluster)*
>
>               cluster
>               1  2  3
>   setosa*     50  0  0*
>   versicolor*  0 48  2*
>   virginica*   0 14 36*
>
> Cluster 1 captured the isolated setosa species rather well, but versicolor
> and virginica are not well separated because cluster 2 has 80% of
> versicolor and 20% of virginica.
>
> On the other hand, if we use 7 clusters,
>
>
> *> k = kmeans(iris[,1:4], centers=7, nstart=10)*
>
> *> table(iris$Species, k$cluster)*
>
>                    cluster
>               1  2  3  4  5  6  7
>   setosa*      0  0 28  0 22  0  0*
>   versicolor*  0  7  0 20  0  0 23*
>   virginica*  12  0  0  1  0 24 13*
>
> Each cluster is now composed of almost exactly one species.  Only cluster
> 4 has any impurity and it is 95% composed of just versicolor samples.
>
> What this means is that we can use the 7 cluster k-means results to build
> a classifier that has a 1 of 7 input feature (cluster id) instead of 4 real
> values.  That is, we have compressed the 4 original continuous features
> down to about 2.7 bits on average and this compressed representation
> actually makes building a classifier nearly trivial.
>
> *> -sum(table(k$cluster) * log(table(k$cluster)/ 150)/150)/log(2)*
> *2.670288*
>
> This is pretty cool compression given that the original continuous data
> had about 22 bits of raw information capacity based on the range and
> precision given.
>
> Now, in the real world, we would need to hold out some data for cross
> validation of the clustering, but the fact remains that if you want to use
> the k-means clustering for some machine oriented purpose, it pays to have
> as many clusters as you can have, subject to the held-out data agreeing
> with the original data on distance distributions and counts for the
> clusters.
>

Re: Streaming KMeans clustering

Posted by Ted Dunning <te...@gmail.com>.
On Sat, Dec 28, 2013 at 1:10 PM, Johannes Schulte <
johannes.schulte@gmail.com> wrote:

> Okay, understood. For a lot of clusters (which i don't necessarily
> attribute to big data problems but definetly to nearest neighbour like
> usage of clusters), the every "cluster sees every point" doesnt scale well.
>


Nearest neighbor sorts of problems.  Or market segmentation kinds of
problems.  Or feature generation kinds of problems.  Or volume quantization
kinds of problems.

The point is that with more data, you can derive a more detailed model.
 That means more clusters.

The ideal case is that we have an infinite mixture model to describe the
data distribution, but we haven't seen most of the mixture components yet.
 As we get more data, we can justify saying we have more components.

Even if you think that there are *really* 10 clusters in the data, I can
make the case that you are going to get a better unsupervised description
of those clusters by using 100 or more clusters in the k-means algorithm.
 The idea is that each of your real clusters will be composed of several of
the k-means clusters and finding the attachments is much easier because 100
clusters is much smaller than the original number of samples.

As a small data example, suppose you are looking at Anderson's Iris data
which is available built into R.  If we plot the 150 data points against
the features we have in pairs, we can see various patterns and see that a
non-linear classifier should do quite well in separating the classes (I
hope the image makes it through the emailer):

[image: Inline image 1]

But if we try to do k-means on these data with only 3 clusters, we get very
poor assignment to the three species:


*> k = kmeans(iris[,1:4], centers=3, nstart=10)*
*> table(iris$Species, k$cluster)*

              cluster
              1  2  3
  setosa*     50  0  0*
  versicolor*  0 48  2*
  virginica*   0 14 36*

Cluster 1 captured the isolated setosa species rather well, but versicolor
and virginica are not well separated because cluster 2 has 80% of
versicolor and 20% of virginica.

On the other hand, if we use 7 clusters,


*> k = kmeans(iris[,1:4], centers=7, nstart=10)*

*> table(iris$Species, k$cluster)*

                   cluster
              1  2  3  4  5  6  7
  setosa*      0  0 28  0 22  0  0*
  versicolor*  0  7  0 20  0  0 23*
  virginica*  12  0  0  1  0 24 13*

Each cluster is now composed of almost exactly one species.  Only cluster 4
has any impurity and it is 95% composed of just versicolor samples.

What this means is that we can use the 7 cluster k-means results to build a
classifier that has a 1 of 7 input feature (cluster id) instead of 4 real
values.  That is, we have compressed the 4 original continuous features
down to about 2.7 bits on average and this compressed representation
actually makes building a classifier nearly trivial.

*> -sum(table(k$cluster) * log(table(k$cluster)/ 150)/150)/log(2)*
*2.670288*

This is pretty cool compression given that the original continuous data had
about 22 bits of raw information capacity based on the range and precision
given.

Now, in the real world, we would need to hold out some data for cross
validation of the clustering, but the fact remains that if you want to use
the k-means clustering for some machine oriented purpose, it pays to have
as many clusters as you can have, subject to the held-out data agreeing
with the original data on distance distributions and counts for the
clusters.

Re: Streaming KMeans clustering

Posted by Johannes Schulte <jo...@gmail.com>.
Okay, understood. For a lot of clusters (which i don't necessarily
attribute to big data problems but definetly to nearest neighbour like
usage of clusters), the every "cluster sees every point" doesnt scale well.

However, for (final) 1000 clusters i see around 200 distance measure
calculations per point, a lot better than 1000. And another thing that
makes me think is that the test runs A LOT faster than with 10 final
clusters.


On Sat, Dec 28, 2013 at 1:09 AM, Ted Dunning <te...@gmail.com> wrote:

> Of course, this is just for one pass of k-means.  If you need 4 passes, you
> have break-even.
>
> More typically for big data problems, k=1000 or some such.  Total number of
> distance computations for streaming k-means will still be about 40 (or
> adjust to the more theory motivated value of log k + log log N = 10 + 5 and
> then adjust with a bit of fudge for real world).
>
> For k-means in that case, you still have 1000 distances to compute per pass
> and multiple passes to do.  That ratio then becomes something more like
> 10,000 / 40 = 250.
>
>
>
> On Fri, Dec 27, 2013 at 12:55 PM, Johannes Schulte <
> johannes.schulte@gmail.com> wrote:
>
> > I updated the repository (with the typo)
> >
> > git@github.com:baunz/cluster-comprarison.git
> >
> > to include more logging information about the number of times the
> distance
> > measure calculation is triggered (which is the most expensive thing imo).
> > the factor of dist. measure calculations per point seen is about 40 at
> > streaming k-means and 10 for regular k-means (because there are 10
> > clusters).
> >
> > This is of course dependent on the searchSize Parameter but i used the
> > default value of 2.
> >
> >
> >
> > On Fri, Dec 27, 2013 at 6:54 PM, Isabel Drost-Fromm <isabel@apache.org
> > >wrote:
> >
> > >
> > > Hi Dan,
> > >
> > >
> > > On Fri, 27 Dec 2013 14:13:51 +0200
> > > Dan Filimon <df...@apache.org> wrote:
> > > > Thoughts?
> > >
> > > First of all - good to see you back on dev@ :)
> > >
> > > Seems a few people have run into these issues. As currently there is no
> > > high level documentation for the whole streaming kmeans implementation
> > > - would you mind writing up the limitation and advise you have for
> users
> > > of this algorithm? Doesn't need to be anything fancy - essentially a
> > > here's how you compute how much memory you need to run this, here's the
> > > limitations and the flags to deal with these, here's things that should
> > > be changed or fixed in a later iteration - unless your previous mail
> > > covers all of this already. This could safe people a few debugging
> > > cycles when getting started with this at scale.
> > >
> > > Feel free to get it into our web page (if you are short in time, just
> > > write something up using markdown, I can take over publishing it).
> > >
> > > Isabel
> > >
> >
>

Re: Streaming KMeans clustering

Posted by Ted Dunning <te...@gmail.com>.
Of course, this is just for one pass of k-means.  If you need 4 passes, you
have break-even.

More typically for big data problems, k=1000 or some such.  Total number of
distance computations for streaming k-means will still be about 40 (or
adjust to the more theory motivated value of log k + log log N = 10 + 5 and
then adjust with a bit of fudge for real world).

For k-means in that case, you still have 1000 distances to compute per pass
and multiple passes to do.  That ratio then becomes something more like
10,000 / 40 = 250.



On Fri, Dec 27, 2013 at 12:55 PM, Johannes Schulte <
johannes.schulte@gmail.com> wrote:

> I updated the repository (with the typo)
>
> git@github.com:baunz/cluster-comprarison.git
>
> to include more logging information about the number of times the distance
> measure calculation is triggered (which is the most expensive thing imo).
> the factor of dist. measure calculations per point seen is about 40 at
> streaming k-means and 10 for regular k-means (because there are 10
> clusters).
>
> This is of course dependent on the searchSize Parameter but i used the
> default value of 2.
>
>
>
> On Fri, Dec 27, 2013 at 6:54 PM, Isabel Drost-Fromm <isabel@apache.org
> >wrote:
>
> >
> > Hi Dan,
> >
> >
> > On Fri, 27 Dec 2013 14:13:51 +0200
> > Dan Filimon <df...@apache.org> wrote:
> > > Thoughts?
> >
> > First of all - good to see you back on dev@ :)
> >
> > Seems a few people have run into these issues. As currently there is no
> > high level documentation for the whole streaming kmeans implementation
> > - would you mind writing up the limitation and advise you have for users
> > of this algorithm? Doesn't need to be anything fancy - essentially a
> > here's how you compute how much memory you need to run this, here's the
> > limitations and the flags to deal with these, here's things that should
> > be changed or fixed in a later iteration - unless your previous mail
> > covers all of this already. This could safe people a few debugging
> > cycles when getting started with this at scale.
> >
> > Feel free to get it into our web page (if you are short in time, just
> > write something up using markdown, I can take over publishing it).
> >
> > Isabel
> >
>

Re: Streaming KMeans clustering

Posted by Johannes Schulte <jo...@gmail.com>.
I updated the repository (with the typo)

git@github.com:baunz/cluster-comprarison.git

to include more logging information about the number of times the distance
measure calculation is triggered (which is the most expensive thing imo).
the factor of dist. measure calculations per point seen is about 40 at
streaming k-means and 10 for regular k-means (because there are 10
clusters).

This is of course dependent on the searchSize Parameter but i used the
default value of 2.



On Fri, Dec 27, 2013 at 6:54 PM, Isabel Drost-Fromm <is...@apache.org>wrote:

>
> Hi Dan,
>
>
> On Fri, 27 Dec 2013 14:13:51 +0200
> Dan Filimon <df...@apache.org> wrote:
> > Thoughts?
>
> First of all - good to see you back on dev@ :)
>
> Seems a few people have run into these issues. As currently there is no
> high level documentation for the whole streaming kmeans implementation
> - would you mind writing up the limitation and advise you have for users
> of this algorithm? Doesn't need to be anything fancy - essentially a
> here's how you compute how much memory you need to run this, here's the
> limitations and the flags to deal with these, here's things that should
> be changed or fixed in a later iteration - unless your previous mail
> covers all of this already. This could safe people a few debugging
> cycles when getting started with this at scale.
>
> Feel free to get it into our web page (if you are short in time, just
> write something up using markdown, I can take over publishing it).
>
> Isabel
>

Re: Streaming KMeans clustering

Posted by Isabel Drost-Fromm <is...@apache.org>.
Hi Dan,


On Fri, 27 Dec 2013 14:13:51 +0200
Dan Filimon <df...@apache.org> wrote:
> Thoughts?

First of all - good to see you back on dev@ :)

Seems a few people have run into these issues. As currently there is no
high level documentation for the whole streaming kmeans implementation
- would you mind writing up the limitation and advise you have for users
of this algorithm? Doesn't need to be anything fancy - essentially a
here's how you compute how much memory you need to run this, here's the
limitations and the flags to deal with these, here's things that should
be changed or fixed in a later iteration - unless your previous mail
covers all of this already. This could safe people a few debugging
cycles when getting started with this at scale.

Feel free to get it into our web page (if you are short in time, just
write something up using markdown, I can take over publishing it).

Isabel

Re: Streaming KMeans clustering

Posted by Dan Filimon <df...@apache.org>.
Hi everyone!

So for the two issues:

1. Mapper slowness: this is basically an issue with the searcher being
used. The default is ProjectionSearch which was doing a good job. If the
bottleneck is indeed remove or searchFirst, that sort of point outs a
limitation in the basic algorithm (unless it turns out there's something
super dumb going on).

2. Reducer OOM: for this job, if we have m mappers, clustering n points
into k clusters, each mapper should get roughly  n  / m points to cluster,
and produce k log (n / m) centroids. The total number of points that the
reducer gets is m * k * log (n / m).

As you can see, this means that this really depends on the particular data
set we're working with. Suppose k is n / 10 and you have m = 10 mappers.
That gets you 10 * n / 10 * log (n / 10) ~ n log n points that the reducer
has to cluster and really it makes this approach totally useless because
you'll have more points at the end than at the beginning.

In any case, if the number of reducer centroids (the m * k * log (n / m))
is acceptable, there's an option to run another StreamingKMeans in the
reducer: there's the reduceStreamingKMeans flag in the driver.

However, I feel that if you see yourself needing this flag, it probably
shows that this MapReduce approach is not what you want and you should just
run StreamingKMeans directly.

I think in retrospect, that there should be code that checks for this in
the driver and spits out a warning. :)

Thoughts?

(Happy Holidays to everyone too! :D)



On Fri, Dec 27, 2013 at 9:59 AM, Sotiris Salloumis <in...@eprice.gr> wrote:

> Hi Suneel,
>
> Is it possible to upload debug or log messages from the OOM exceptions you
> have seen to take a look on them?
>
> Regards
> Sotiris
>
>
> On Thu, Dec 26, 2013 at 8:19 PM, Suneel Marthi <suneel_marthi@yahoo.com
> >wrote:
>
> > I would push the code freeze until this is resolved (and the reason I had
> > been holding off). This is something that should have been raised for 0.8
> > release and I dob;t think we should defer this to the next one.
> >
> > I heard people outside of dev@ and user@ who have tried running
> Streaming
> > KMeans (from 0.8) on their Production clusters on large datasets and had
> > seen the job crash in the Reduce phase due to OOM errors (this is with
> > -Xmx2GB).
> >
> >
> >
> >
> >
> >
> > On Thursday, December 26, 2013 12:53 PM, Isabel Drost-Fromm <
> > isabel@apache.org> wrote:
> >
> > On Thu, Dec 26, 2013 at 12:28:18AM -0800, Suneel Marthi wrote:
> >
> > > Its when you increase the no. of documents and the size of each
> > >  document (add more dimensions) that you start seeing performance
> issues
> > which are:
> > > a)The Mappers take long to complete and its either the
> searcher.remove()
> > or searcher.searchFirst() calls (will check again in my next attempt)
> that
> > seems to be the bottleneck.
> > > b) Once the Mappers complete (after several hours) the Reducer dies
> with
> > an OOM exception (despite having set -Xmx2G).
> >
> > Given that there seem to be a couple of people experiencing issues I
> think
> > it makes sense to create a JIRA issue here to track progress - either
> code
> > improvements or better documentation on how to run this implementation.
> >
> > @Suneel: Does it make sense to push code freeze to after fixing this or
> > should this be communicated as a known defect in the release notes?
> >
> >
> > Isabel
>

Re: Streaming KMeans clustering

Posted by Isabel Drost-Fromm <is...@apache.org>.
On Fri, 27 Dec 2013 09:59:05 +0200
Sotiris Salloumis <in...@eprice.gr> wrote:
> Is it possible to upload debug or log messages from the OOM
> exceptions you have seen to take a look on them?

Preferably to a JIRA issue related to the issues seen.


Isabel

Re: Streaming KMeans clustering

Posted by Sotiris Salloumis <in...@eprice.gr>.
Hi Suneel,

Is it possible to upload debug or log messages from the OOM exceptions you
have seen to take a look on them?

Regards
Sotiris


On Thu, Dec 26, 2013 at 8:19 PM, Suneel Marthi <su...@yahoo.com>wrote:

> I would push the code freeze until this is resolved (and the reason I had
> been holding off). This is something that should have been raised for 0.8
> release and I dob;t think we should defer this to the next one.
>
> I heard people outside of dev@ and user@ who have tried running Streaming
> KMeans (from 0.8) on their Production clusters on large datasets and had
> seen the job crash in the Reduce phase due to OOM errors (this is with
> -Xmx2GB).
>
>
>
>
>
>
> On Thursday, December 26, 2013 12:53 PM, Isabel Drost-Fromm <
> isabel@apache.org> wrote:
>
> On Thu, Dec 26, 2013 at 12:28:18AM -0800, Suneel Marthi wrote:
>
> > Its when you increase the no. of documents and the size of each
> >  document (add more dimensions) that you start seeing performance issues
> which are:
> > a)The Mappers take long to complete and its either the searcher.remove()
> or searcher.searchFirst() calls (will check again in my next attempt) that
> seems to be the bottleneck.
> > b) Once the Mappers complete (after several hours) the Reducer dies with
> an OOM exception (despite having set -Xmx2G).
>
> Given that there seem to be a couple of people experiencing issues I think
> it makes sense to create a JIRA issue here to track progress - either code
> improvements or better documentation on how to run this implementation.
>
> @Suneel: Does it make sense to push code freeze to after fixing this or
> should this be communicated as a known defect in the release notes?
>
>
> Isabel

Re: Streaming KMeans clustering

Posted by Suneel Marthi <su...@yahoo.com>.
I would push the code freeze until this is resolved (and the reason I had been holding off). This is something that should have been raised for 0.8 release and I dob;t think we should defer this to the next one.

I heard people outside of dev@ and user@ who have tried running Streaming KMeans (from 0.8) on their Production clusters on large datasets and had seen the job crash in the Reduce phase due to OOM errors (this is with -Xmx2GB). 






On Thursday, December 26, 2013 12:53 PM, Isabel Drost-Fromm <is...@apache.org> wrote:
 
On Thu, Dec 26, 2013 at 12:28:18AM -0800, Suneel Marthi wrote:

> Its when you increase the no. of documents and the size of each
>  document (add more dimensions) that you start seeing performance issues which are:
> a)The Mappers take long to complete and its either the searcher.remove() or searcher.searchFirst() calls (will check again in my next attempt) that seems to be the bottleneck.
> b) Once the Mappers complete (after several hours) the Reducer dies with an OOM exception (despite having set -Xmx2G).

Given that there seem to be a couple of people experiencing issues I think it makes sense to create a JIRA issue here to track progress - either code improvements or better documentation on how to run this implementation.

@Suneel: Does it make sense to push code freeze to after fixing this or should this be communicated as a known defect in the release notes?


Isabel

Re: Streaming KMeans clustering

Posted by Isabel Drost-Fromm <is...@apache.org>.
On Thu, Dec 26, 2013 at 12:28:18AM -0800, Suneel Marthi wrote:
> Its when you increase the no. of documents and the size of each
>  document (add more dimensions) that you start seeing performance issues which are:
> a)The Mappers take long to complete and its either the searcher.remove() or searcher.searchFirst() calls (will check again in my next attempt) that seems to be the bottleneck.
> b) Once the Mappers complete (after several hours) the Reducer dies with an OOM exception (despite having set -Xmx2G).

Given that there seem to be a couple of people experiencing issues I think it makes sense to create a JIRA issue here to track progress - either code improvements or better documentation on how to run this implementation.

@Suneel: Does it make sense to push code freeze to after fixing this or should this be communicated as a known defect in the release notes?


Isabel


Re: Streaming KMeans clustering

Posted by Suneel Marthi <su...@yahoo.com>.
Streaming KMeans is now part of examples/cluster-reuters.sh in the trunk (for 0.9 release).
  
While it runs successfully (both sequential and MapReduce versions), its only clustering a collection about 21K small documents (from the Reuters-21578 corpus) and the values I had used were k = 10, km = 100 and rskm = true -sc FastProjectionSearch -dm CosineDistanceMeasure .

Its when you increase the no. of documents and the size of each
 document (add more dimensions) that you start seeing performance issues which are:
a)  The Mappers take long to complete and its either the searcher.remove() or searcher.searchFirst() calls (will check again in my next attempt) that seems to be the bottleneck.
b) Once the Mappers complete (after several hours) the Reducer dies with an OOM exception (despite having set -Xmx2G).

BTW, I was using a 300K document corpus that Amir had provided and is available at http://gluegadget.com/split-vectors.tar.bz2 and the values were k = 1000, km = 12610, sc = FastProjectionSearch. 

Each mapper was taking about 45 minutes to complete (with the aforementioned CLI args) and my Reducer just failed when it timed out.

Amir had reported that the MapReduce version completed successfully after 16 hrs. He was running on a single node, pseudo-distributed Hadoop and had to set 'mapred.reduce.child.java.opts" to "-Xmx10240M" and "mapred.task.timeout" to "360000000" (100 days!) for this to complete successfully.

If I increased the number of clusters k = 200,000 then the mappers finish real quick (about 2-3 mins each), but the Reducer again fails with either a timeout or OOM exception. I did not set my 'mapred.task.timeout' to be 100 days (that's definitely not gonna cut it in a corporate environment).











On Thursday, December 26, 2013 2:58 AM, Johannes Schulte <jo...@gmail.com> wrote:
 
To be honest, i always cancelled the sketching after a while because i
wasn't satisfied with the points per second speed. The version used is the
0.8 release.

if i find the time i'm gonna look what is called when and where and how
often and what the problem could be.



On Thu, Dec 26, 2013 at 8:22 AM, Ted Dunning <te...@gmail.com> wrote:

> Interesting.  In Dan's
 tests on sparse data, he got about 10x speedup net.
>
> You didn't run multiple sketching passes did you?
>
>
> Also,
 which version?  There was a horrendous clone in there at one time.
>
>
>
>
> On Wed, Dec 25, 2013 at 2:07 PM, Johannes Schulte <
> johannes.schulte@gmail.com> wrote:
>
> > everybody should have the right to do
> >
> > job.getConfiguration().set("mapred.reduce.child.java.opts", "-Xmx2G");
> >
> > for that :)
> >
> >
> > For my problems, i always felt the sketching took too long. i put up a
> > simple comparison here:
> >
> > git@github.com:baunz/cluster-comprarison.git
> >
> > it generates some sample vectors and clusters them with regular k-means,
> > and streaming k-means, both sequentially. i took 10 kmeans iterations as
> a
> > benchmark and used the default values for FastProjectionSearch from the
> > kMeans Driver Class.
> >
> > Visual VM tells me the most time is spent in
> FastProjectionSearch.remove().
> > This is called on every added datapoint.
> >
> > Maybe i got something wrong but for this sparse, high dimensional
> vectors i
> > never got streaming k-means faster than the regula version
> >
> >
> >
> >
> > On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi <suneel_marthi@yahoo.com
> > >wrote:
> >
> > > Not sure how that would work in a corporate setting wherein there's a
> > > fixed systemwide setting that cannot be overridden.
> > >
> > > Sent from my iPhone
> > >
> > > > On Dec 25, 2013, at 9:44 AM, Sebastian   Schelter <ss...@apache.org>
> > > wrote:
> > > >
> > >
 >> On 25.12.2013 14:19, Suneel Marthi wrote:
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>>> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <
> > > ted.dunning@gmail.com> wrote:
> > > >>
> > > >>>> For reference, on a 16 core machine, I was able to run the
> > sequential
> > > >>>> version of streaming k-means on 1,000,000 points, each with 10
> > > dimensions
> > > >>>> in about 20 seconds.  The
 map-reduce versions are comparable
> subject
> > > to
> > > >>>> scaling except for startup time.
> > > >>
> > > >> @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8.
> > > Not sure how this would have even worked for u in sequential mode in
> > light
> > > of the issues reported against M-1314, M-1358, M-1380 (all of which
> > impact
> > > the sequential mode); unless u had fixed them locally.
> > > >> What were ur estimatedDistanceCutoff, number of clusters 'k',
> > > projection search and how much memory did u have to allocate to the
> > single
> > > Reducer?
> > > >
> > > > If I read the source code correctly, the final reducer clusters the
> > > > sketch which should contain m * k * log n intermediate centroids,
> where
> > > > k is the number of desired clusters, m is the number of mappers run
> and
> > > > n is the number of datapoints. Those centroids are expected to be
> > dense,
> > > > so we can estimate the memory required for the final reducer using
> this
> > > > formula.
> > > >
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>> On Mon, Dec 23, 2013 at
 1:41 PM, Sebastian Schelter <
> ssc@apache.org>
> > > wrote:
> > > >>>
> > > >>> That the algorithm runs a single reducer is expected. The algorithm
> > > >>> creates a sketch of
> > > >> the data in parallel in the map-phase, which is
> > > >>> collected by the reducer afterwards. The reducer then applies an
> > > >>> expensive in-memory clustering algorithm to the sketch.
> > > >>>
> > > >>> Which dataset are you using for testing? I can also do some tests
> on
> > a
>
 > > >>> cluster here.
> > >
 >>>
> > > >>> I can imagine two possible causes for the problems: Maybe there's a
> > > >>> problem with the vectors and some calculations take very long
> because
> > > >>> the wrong access pattern or implementation is chosen.
> > > >>>
> > > >>> Another problem could be that the mappers and reducers have too few
> > > >>> memory and spend a lot of time running garbage collections.
> > > >>>
> > > >>> --sebastian
> > > >>>
> > > >>>
> > > >>> On 23.12.2013 22:14,
> > > >> Suneel Marthi wrote:
> > >
 >>>> Has anyone be successful running Streaming KMeans clustering on a
> > > large
> > > >>> dataset (> 100,000 points)?
> > > >>>>
> > > >>>>
> > > >>>> It just seems to take a very long time (> 4hrs) for the mappers to
> > > >>> finish on about 300K data points and the reduce phase has only a
> > single
> > > >>> reducer running and throws an OOM failing the job several hours
> after
> > > the
> > > >>> job has been kicked off.
> > > >>>>
> > > >>>> Its the same story when trying to run in sequential mode.
>
 > > >>>>
> > > >>>> Looking at the code the bottleneck seems to be in
> > > >>> StreamingKMeans.clusterInternal(), without understanding the
> > behaviour
> > > of
> > > >>> the algorithm I am not sure if the sequence of steps in there is
> > > correct.
> > > >>>>
> > > >>>>
> > > >>>> There are few calls that call themselves repeatedly over and over
> > > again
> > > >>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
> > > >>>>
> > > >>>> We really need to have this working on datasets that are
 larger
> than
> > > 20K
> > > >>> reuters datasets.
> > > >>>>
> > > >>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and
> > > >>> FastProjectSearch.
> > > >
> > >
> >
>

Re: Streaming KMeans clustering

Posted by Johannes Schulte <jo...@gmail.com>.
To be honest, i always cancelled the sketching after a while because i
wasn't satisfied with the points per second speed. The version used is the
0.8 release.

if i find the time i'm gonna look what is called when and where and how
often and what the problem could be.


On Thu, Dec 26, 2013 at 8:22 AM, Ted Dunning <te...@gmail.com> wrote:

> Interesting.  In Dan's tests on sparse data, he got about 10x speedup net.
>
> You didn't run multiple sketching passes did you?
>
>
> Also, which version?  There was a horrendous clone in there at one time.
>
>
>
>
> On Wed, Dec 25, 2013 at 2:07 PM, Johannes Schulte <
> johannes.schulte@gmail.com> wrote:
>
> > everybody should have the right to do
> >
> > job.getConfiguration().set("mapred.reduce.child.java.opts", "-Xmx2G");
> >
> > for that :)
> >
> >
> > For my problems, i always felt the sketching took too long. i put up a
> > simple comparison here:
> >
> > git@github.com:baunz/cluster-comprarison.git
> >
> > it generates some sample vectors and clusters them with regular k-means,
> > and streaming k-means, both sequentially. i took 10 kmeans iterations as
> a
> > benchmark and used the default values for FastProjectionSearch from the
> > kMeans Driver Class.
> >
> > Visual VM tells me the most time is spent in
> FastProjectionSearch.remove().
> > This is called on every added datapoint.
> >
> > Maybe i got something wrong but for this sparse, high dimensional
> vectors i
> > never got streaming k-means faster than the regula version
> >
> >
> >
> >
> > On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi <suneel_marthi@yahoo.com
> > >wrote:
> >
> > > Not sure how that would work in a corporate setting wherein there's a
> > > fixed systemwide setting that cannot be overridden.
> > >
> > > Sent from my iPhone
> > >
> > > > On Dec 25, 2013, at 9:44 AM, Sebastian   Schelter <ss...@apache.org>
> > > wrote:
> > > >
> > > >> On 25.12.2013 14:19, Suneel Marthi wrote:
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>>> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <
> > > ted.dunning@gmail.com> wrote:
> > > >>
> > > >>>> For reference, on a 16 core machine, I was able to run the
> > sequential
> > > >>>> version of streaming k-means on 1,000,000 points, each with 10
> > > dimensions
> > > >>>> in about 20 seconds.  The map-reduce versions are comparable
> subject
> > > to
> > > >>>> scaling except for startup time.
> > > >>
> > > >> @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8.
> > > Not sure how this would have even worked for u in sequential mode in
> > light
> > > of the issues reported against M-1314, M-1358, M-1380 (all of which
> > impact
> > > the sequential mode); unless u had fixed them locally.
> > > >> What were ur estimatedDistanceCutoff, number of clusters 'k',
> > > projection search and how much memory did u have to allocate to the
> > single
> > > Reducer?
> > > >
> > > > If I read the source code correctly, the final reducer clusters the
> > > > sketch which should contain m * k * log n intermediate centroids,
> where
> > > > k is the number of desired clusters, m is the number of mappers run
> and
> > > > n is the number of datapoints. Those centroids are expected to be
> > dense,
> > > > so we can estimate the memory required for the final reducer using
> this
> > > > formula.
> > > >
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <
> ssc@apache.org>
> > > wrote:
> > > >>>
> > > >>> That the algorithm runs a single reducer is expected. The algorithm
> > > >>> creates a sketch of
> > > >> the data in parallel in the map-phase, which is
> > > >>> collected by the reducer afterwards. The reducer then applies an
> > > >>> expensive in-memory clustering algorithm to the sketch.
> > > >>>
> > > >>> Which dataset are you using for testing? I can also do some tests
> on
> > a
> > > >>> cluster here.
> > > >>>
> > > >>> I can imagine two possible causes for the problems: Maybe there's a
> > > >>> problem with the vectors and some calculations take very long
> because
> > > >>> the wrong access pattern or implementation is chosen.
> > > >>>
> > > >>> Another problem could be that the mappers and reducers have too few
> > > >>> memory and spend a lot of time running garbage collections.
> > > >>>
> > > >>> --sebastian
> > > >>>
> > > >>>
> > > >>> On 23.12.2013 22:14,
> > > >> Suneel Marthi wrote:
> > > >>>> Has anyone be successful running Streaming KMeans clustering on a
> > > large
> > > >>> dataset (> 100,000 points)?
> > > >>>>
> > > >>>>
> > > >>>> It just seems to take a very long time (> 4hrs) for the mappers to
> > > >>> finish on about 300K data points and the reduce phase has only a
> > single
> > > >>> reducer running and throws an OOM failing the job several hours
> after
> > > the
> > > >>> job has been kicked off.
> > > >>>>
> > > >>>> Its the same story when trying to run in sequential mode.
> > > >>>>
> > > >>>> Looking at the code the bottleneck seems to be in
> > > >>> StreamingKMeans.clusterInternal(), without understanding the
> > behaviour
> > > of
> > > >>> the algorithm I am not sure if the sequence of steps in there is
> > > correct.
> > > >>>>
> > > >>>>
> > > >>>> There are few calls that call themselves repeatedly over and over
> > > again
> > > >>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
> > > >>>>
> > > >>>> We really need to have this working on datasets that are larger
> than
> > > 20K
> > > >>> reuters datasets.
> > > >>>>
> > > >>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and
> > > >>> FastProjectSearch.
> > > >
> > >
> >
>

Re: Streaming KMeans clustering

Posted by Ted Dunning <te...@gmail.com>.
Interesting.  In Dan's tests on sparse data, he got about 10x speedup net.

You didn't run multiple sketching passes did you?


Also, which version?  There was a horrendous clone in there at one time.




On Wed, Dec 25, 2013 at 2:07 PM, Johannes Schulte <
johannes.schulte@gmail.com> wrote:

> everybody should have the right to do
>
> job.getConfiguration().set("mapred.reduce.child.java.opts", "-Xmx2G");
>
> for that :)
>
>
> For my problems, i always felt the sketching took too long. i put up a
> simple comparison here:
>
> git@github.com:baunz/cluster-comprarison.git
>
> it generates some sample vectors and clusters them with regular k-means,
> and streaming k-means, both sequentially. i took 10 kmeans iterations as a
> benchmark and used the default values for FastProjectionSearch from the
> kMeans Driver Class.
>
> Visual VM tells me the most time is spent in FastProjectionSearch.remove().
> This is called on every added datapoint.
>
> Maybe i got something wrong but for this sparse, high dimensional vectors i
> never got streaming k-means faster than the regula version
>
>
>
>
> On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi <suneel_marthi@yahoo.com
> >wrote:
>
> > Not sure how that would work in a corporate setting wherein there's a
> > fixed systemwide setting that cannot be overridden.
> >
> > Sent from my iPhone
> >
> > > On Dec 25, 2013, at 9:44 AM, Sebastian   Schelter <ss...@apache.org>
> > wrote:
> > >
> > >> On 25.12.2013 14:19, Suneel Marthi wrote:
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>>> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <
> > ted.dunning@gmail.com> wrote:
> > >>
> > >>>> For reference, on a 16 core machine, I was able to run the
> sequential
> > >>>> version of streaming k-means on 1,000,000 points, each with 10
> > dimensions
> > >>>> in about 20 seconds.  The map-reduce versions are comparable subject
> > to
> > >>>> scaling except for startup time.
> > >>
> > >> @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8.
> > Not sure how this would have even worked for u in sequential mode in
> light
> > of the issues reported against M-1314, M-1358, M-1380 (all of which
> impact
> > the sequential mode); unless u had fixed them locally.
> > >> What were ur estimatedDistanceCutoff, number of clusters 'k',
> > projection search and how much memory did u have to allocate to the
> single
> > Reducer?
> > >
> > > If I read the source code correctly, the final reducer clusters the
> > > sketch which should contain m * k * log n intermediate centroids, where
> > > k is the number of desired clusters, m is the number of mappers run and
> > > n is the number of datapoints. Those centroids are expected to be
> dense,
> > > so we can estimate the memory required for the final reducer using this
> > > formula.
> > >
> > >>
> > >>
> > >>
> > >>
> > >>> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <ss...@apache.org>
> > wrote:
> > >>>
> > >>> That the algorithm runs a single reducer is expected. The algorithm
> > >>> creates a sketch of
> > >> the data in parallel in the map-phase, which is
> > >>> collected by the reducer afterwards. The reducer then applies an
> > >>> expensive in-memory clustering algorithm to the sketch.
> > >>>
> > >>> Which dataset are you using for testing? I can also do some tests on
> a
> > >>> cluster here.
> > >>>
> > >>> I can imagine two possible causes for the problems: Maybe there's a
> > >>> problem with the vectors and some calculations take very long because
> > >>> the wrong access pattern or implementation is chosen.
> > >>>
> > >>> Another problem could be that the mappers and reducers have too few
> > >>> memory and spend a lot of time running garbage collections.
> > >>>
> > >>> --sebastian
> > >>>
> > >>>
> > >>> On 23.12.2013 22:14,
> > >> Suneel Marthi wrote:
> > >>>> Has anyone be successful running Streaming KMeans clustering on a
> > large
> > >>> dataset (> 100,000 points)?
> > >>>>
> > >>>>
> > >>>> It just seems to take a very long time (> 4hrs) for the mappers to
> > >>> finish on about 300K data points and the reduce phase has only a
> single
> > >>> reducer running and throws an OOM failing the job several hours after
> > the
> > >>> job has been kicked off.
> > >>>>
> > >>>> Its the same story when trying to run in sequential mode.
> > >>>>
> > >>>> Looking at the code the bottleneck seems to be in
> > >>> StreamingKMeans.clusterInternal(), without understanding the
> behaviour
> > of
> > >>> the algorithm I am not sure if the sequence of steps in there is
> > correct.
> > >>>>
> > >>>>
> > >>>> There are few calls that call themselves repeatedly over and over
> > again
> > >>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
> > >>>>
> > >>>> We really need to have this working on datasets that are larger than
> > 20K
> > >>> reuters datasets.
> > >>>>
> > >>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and
> > >>> FastProjectSearch.
> > >
> >
>

Re: Streaming KMeans clustering

Posted by Johannes Schulte <jo...@gmail.com>.
everybody should have the right to do

job.getConfiguration().set("mapred.reduce.child.java.opts", "-Xmx2G");

for that :)


For my problems, i always felt the sketching took too long. i put up a
simple comparison here:

git@github.com:baunz/cluster-comprarison.git

it generates some sample vectors and clusters them with regular k-means,
and streaming k-means, both sequentially. i took 10 kmeans iterations as a
benchmark and used the default values for FastProjectionSearch from the
kMeans Driver Class.

Visual VM tells me the most time is spent in FastProjectionSearch.remove().
This is called on every added datapoint.

Maybe i got something wrong but for this sparse, high dimensional vectors i
never got streaming k-means faster than the regula version




On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi <su...@yahoo.com>wrote:

> Not sure how that would work in a corporate setting wherein there's a
> fixed systemwide setting that cannot be overridden.
>
> Sent from my iPhone
>
> > On Dec 25, 2013, at 9:44 AM, Sebastian   Schelter <ss...@apache.org>
> wrote:
> >
> >> On 25.12.2013 14:19, Suneel Marthi wrote:
> >>
> >>
> >>
> >>
> >>
> >>>> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <
> ted.dunning@gmail.com> wrote:
> >>
> >>>> For reference, on a 16 core machine, I was able to run the sequential
> >>>> version of streaming k-means on 1,000,000 points, each with 10
> dimensions
> >>>> in about 20 seconds.  The map-reduce versions are comparable subject
> to
> >>>> scaling except for startup time.
> >>
> >> @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8.
> Not sure how this would have even worked for u in sequential mode in light
> of the issues reported against M-1314, M-1358, M-1380 (all of which impact
> the sequential mode); unless u had fixed them locally.
> >> What were ur estimatedDistanceCutoff, number of clusters 'k',
> projection search and how much memory did u have to allocate to the single
> Reducer?
> >
> > If I read the source code correctly, the final reducer clusters the
> > sketch which should contain m * k * log n intermediate centroids, where
> > k is the number of desired clusters, m is the number of mappers run and
> > n is the number of datapoints. Those centroids are expected to be dense,
> > so we can estimate the memory required for the final reducer using this
> > formula.
> >
> >>
> >>
> >>
> >>
> >>> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <ss...@apache.org>
> wrote:
> >>>
> >>> That the algorithm runs a single reducer is expected. The algorithm
> >>> creates a sketch of
> >> the data in parallel in the map-phase, which is
> >>> collected by the reducer afterwards. The reducer then applies an
> >>> expensive in-memory clustering algorithm to the sketch.
> >>>
> >>> Which dataset are you using for testing? I can also do some tests on a
> >>> cluster here.
> >>>
> >>> I can imagine two possible causes for the problems: Maybe there's a
> >>> problem with the vectors and some calculations take very long because
> >>> the wrong access pattern or implementation is chosen.
> >>>
> >>> Another problem could be that the mappers and reducers have too few
> >>> memory and spend a lot of time running garbage collections.
> >>>
> >>> --sebastian
> >>>
> >>>
> >>> On 23.12.2013 22:14,
> >> Suneel Marthi wrote:
> >>>> Has anyone be successful running Streaming KMeans clustering on a
> large
> >>> dataset (> 100,000 points)?
> >>>>
> >>>>
> >>>> It just seems to take a very long time (> 4hrs) for the mappers to
> >>> finish on about 300K data points and the reduce phase has only a single
> >>> reducer running and throws an OOM failing the job several hours after
> the
> >>> job has been kicked off.
> >>>>
> >>>> Its the same story when trying to run in sequential mode.
> >>>>
> >>>> Looking at the code the bottleneck seems to be in
> >>> StreamingKMeans.clusterInternal(), without understanding the behaviour
> of
> >>> the algorithm I am not sure if the sequence of steps in there is
> correct.
> >>>>
> >>>>
> >>>> There are few calls that call themselves repeatedly over and over
> again
> >>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
> >>>>
> >>>> We really need to have this working on datasets that are larger than
> 20K
> >>> reuters datasets.
> >>>>
> >>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and
> >>> FastProjectSearch.
> >
>

Re: Streaming KMeans clustering

Posted by Suneel Marthi <su...@yahoo.com>.
Not sure how that would work in a corporate setting wherein there's a fixed systemwide setting that cannot be overridden. 

Sent from my iPhone

> On Dec 25, 2013, at 9:44 AM, Sebastian   Schelter <ss...@apache.org> wrote:
> 
>> On 25.12.2013 14:19, Suneel Marthi wrote:
>> 
>> 
>> 
>> 
>> 
>>>> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <te...@gmail.com> wrote:
>> 
>>>> For reference, on a 16 core machine, I was able to run the sequential
>>>> version of streaming k-means on 1,000,000 points, each with 10 dimensions
>>>> in about 20 seconds.  The map-reduce versions are comparable subject to
>>>> scaling except for startup time.
>> 
>> @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. Not sure how this would have even worked for u in sequential mode in light of the issues reported against M-1314, M-1358, M-1380 (all of which impact the sequential mode); unless u had fixed them locally.
>> What were ur estimatedDistanceCutoff, number of clusters 'k', projection search and how much memory did u have to allocate to the single Reducer?
> 
> If I read the source code correctly, the final reducer clusters the
> sketch which should contain m * k * log n intermediate centroids, where
> k is the number of desired clusters, m is the number of mappers run and
> n is the number of datapoints. Those centroids are expected to be dense,
> so we can estimate the memory required for the final reducer using this
> formula.
> 
>> 
>> 
>> 
>> 
>>> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <ss...@apache.org> wrote:
>>> 
>>> That the algorithm runs a single reducer is expected. The algorithm
>>> creates a sketch of
>> the data in parallel in the map-phase, which is
>>> collected by the reducer afterwards. The reducer then applies an
>>> expensive in-memory clustering algorithm to the sketch.
>>> 
>>> Which dataset are you using for testing? I can also do some tests on a
>>> cluster here.
>>> 
>>> I can imagine two possible causes for the problems: Maybe there's a
>>> problem with the vectors and some calculations take very long because
>>> the wrong access pattern or implementation is chosen.
>>> 
>>> Another problem could be that the mappers and reducers have too few
>>> memory and spend a lot of time running garbage collections.
>>> 
>>> --sebastian
>>> 
>>> 
>>> On 23.12.2013 22:14,
>> Suneel Marthi wrote:
>>>> Has anyone be successful running Streaming KMeans clustering on a large
>>> dataset (> 100,000 points)?
>>>> 
>>>> 
>>>> It just seems to take a very long time (> 4hrs) for the mappers to
>>> finish on about 300K data points and the reduce phase has only a single
>>> reducer running and throws an OOM failing the job several hours after the
>>> job has been kicked off.
>>>> 
>>>> Its the same story when trying to run in sequential mode.
>>>> 
>>>> Looking at the code the bottleneck seems to be in
>>> StreamingKMeans.clusterInternal(), without understanding the behaviour of
>>> the algorithm I am not sure if the sequence of steps in there is correct.
>>>> 
>>>> 
>>>> There are few calls that call themselves repeatedly over and over again
>>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
>>>> 
>>>> We really need to have this working on datasets that are larger than 20K
>>> reuters datasets.
>>>> 
>>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and
>>> FastProjectSearch.
> 

Re: Streaming KMeans clustering

Posted by Sebastian Schelter <ss...@apache.org>.
On 25.12.2013 14:19, Suneel Marthi wrote:
> 
> 
> 
> 
> 
>>> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <te...@gmail.com> wrote:
>  
>>> For reference, on a 16 core machine, I was able to run the sequential
>>> version of streaming k-means on 1,000,000 points, each with 10 dimensions
>>> in about 20 seconds.  The map-reduce versions are comparable subject to
>>> scaling except for startup time.
> 
> @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. Not sure how this would have even worked for u in sequential mode in light of the issues reported against M-1314, M-1358, M-1380 (all of which impact the sequential mode); unless u had fixed them locally.
> What were ur estimatedDistanceCutoff, number of clusters 'k', projection search and how much memory did u have to allocate to the single Reducer?

If I read the source code correctly, the final reducer clusters the
sketch which should contain m * k * log n intermediate centroids, where
k is the number of desired clusters, m is the number of mappers run and
n is the number of datapoints. Those centroids are expected to be dense,
so we can estimate the memory required for the final reducer using this
formula.

> 
> 
> 
> 
> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <ss...@apache.org> wrote:
> 
>> That the algorithm runs a single reducer is expected. The algorithm
>> creates a sketch of
>  the data in parallel in the map-phase, which is
>> collected by the reducer afterwards. The reducer then applies an
>> expensive in-memory clustering algorithm to the sketch.
>>
>> Which dataset are you using for testing? I can also do some tests on a
>> cluster here.
>>
>> I can imagine two possible causes for the problems: Maybe there's a
>> problem with the vectors and some calculations take very long because
>> the wrong access pattern or implementation is chosen.
>>
>> Another problem could be that the mappers and reducers have too few
>> memory and spend a lot of time running garbage collections.
>>
>> --sebastian
>>
>>
>> On 23.12.2013 22:14,
>  Suneel Marthi wrote:
>>> Has anyone be successful running Streaming KMeans clustering on a large
>> dataset (> 100,000 points)?
>>>
>>>
>>> It just seems to take a very long time (> 4hrs) for the mappers to
>> finish on about 300K data points and the reduce phase has only a single
>> reducer running and throws an OOM failing the job several hours after the
>> job has been kicked off.
>>>
>>> Its the same story when trying to run in sequential mode.
>>>
>>> Looking at the code the bottleneck seems to be in
>> StreamingKMeans.clusterInternal(), without understanding the behaviour of
>> the algorithm I am not sure if the sequence of steps in there is correct.
>>>
>>>
>>> There are few calls that call themselves repeatedly over and over again
>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
>>>
>>> We really need to have this working on datasets that are larger than 20K
>> reuters datasets.
>>>
>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and
>> FastProjectSearch.
>>>
>>
>>


Re: Streaming KMeans clustering

Posted by Suneel Marthi <su...@yahoo.com>.




>>On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <te...@gmail.com> wrote:
 
>>For reference, on a 16 core machine, I was able to run the sequential
>>version of streaming k-means on 1,000,000 points, each with 10 dimensions
>>in about 20 seconds.  The map-reduce versions are comparable subject to
>>scaling except for startup time.

@Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. Not sure how this would have even worked for u in sequential mode in light of the issues reported against M-1314, M-1358, M-1380 (all of which impact the sequential mode); unless u had fixed them locally.
What were ur estimatedDistanceCutoff, number of clusters 'k', projection search and how much memory did u have to allocate to the single Reducer?




On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <ss...@apache.org> wrote:

> That the algorithm runs a single reducer is expected. The algorithm
> creates a sketch of
 the data in parallel in the map-phase, which is
> collected by the reducer afterwards. The reducer then applies an
> expensive in-memory clustering algorithm to the sketch.
>
> Which dataset are you using for testing? I can also do some tests on a
> cluster here.
>
> I can imagine two possible causes for the problems: Maybe there's a
> problem with the vectors and some calculations take very long because
> the wrong access pattern or implementation is chosen.
>
> Another problem could be that the mappers and reducers have too few
> memory and spend a lot of time running garbage collections.
>
> --sebastian
>
>
> On 23.12.2013 22:14,
 Suneel Marthi wrote:
> > Has anyone be successful running Streaming KMeans clustering on a large
> dataset (> 100,000 points)?
> >
> >
> > It just seems to take a very long time (> 4hrs) for the mappers to
> finish on about 300K data points and the reduce phase has only a single
> reducer running and throws an OOM failing the job several hours after the
> job has been kicked off.
> >
> > Its the same story when trying to run in sequential mode.
> >
> > Looking at the code the bottleneck seems to be in
> StreamingKMeans.clusterInternal(), without understanding the behaviour of
> the algorithm I am not sure if the sequence of steps in there is correct.
> >
> >
> > There are few calls that call themselves repeatedly over and over again
> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
> >
> > We really need to have this working on datasets that are larger than 20K
> reuters datasets.
> >
> > I am trying to run this on 300K vectors with k= 100, km = 1261 and
> FastProjectSearch.
> >
>
>

Re: Streaming KMeans clustering

Posted by Ted Dunning <te...@gmail.com>.
For reference, on a 16 core machine, I was able to run the sequential
version of streaming k-means on 1,000,000 points, each with 10 dimensions
in about 20 seconds.  The map-reduce versions are comparable subject to
scaling except for startup time.


On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <ss...@apache.org> wrote:

> That the algorithm runs a single reducer is expected. The algorithm
> creates a sketch of the data in parallel in the map-phase, which is
> collected by the reducer afterwards. The reducer then applies an
> expensive in-memory clustering algorithm to the sketch.
>
> Which dataset are you using for testing? I can also do some tests on a
> cluster here.
>
> I can imagine two possible causes for the problems: Maybe there's a
> problem with the vectors and some calculations take very long because
> the wrong access pattern or implementation is chosen.
>
> Another problem could be that the mappers and reducers have too few
> memory and spend a lot of time running garbage collections.
>
> --sebastian
>
>
> On 23.12.2013 22:14, Suneel Marthi wrote:
> > Has anyone be successful running Streaming KMeans clustering on a large
> dataset (> 100,000 points)?
> >
> >
> > It just seems to take a very long time (> 4hrs) for the mappers to
> finish on about 300K data points and the reduce phase has only a single
> reducer running and throws an OOM failing the job several hours after the
> job has been kicked off.
> >
> > Its the same story when trying to run in sequential mode.
> >
> > Looking at the code the bottleneck seems to be in
> StreamingKMeans.clusterInternal(), without understanding the behaviour of
> the algorithm I am not sure if the sequence of steps in there is correct.
> >
> >
> > There are few calls that call themselves repeatedly over and over again
> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
> >
> > We really need to have this working on datasets that are larger than 20K
> reuters datasets.
> >
> > I am trying to run this on 300K vectors with k= 100, km = 1261 and
> FastProjectSearch.
> >
>
>

Re: Streaming KMeans clustering

Posted by Sebastian Schelter <ss...@apache.org>.
That the algorithm runs a single reducer is expected. The algorithm
creates a sketch of the data in parallel in the map-phase, which is
collected by the reducer afterwards. The reducer then applies an
expensive in-memory clustering algorithm to the sketch.

Which dataset are you using for testing? I can also do some tests on a
cluster here.

I can imagine two possible causes for the problems: Maybe there's a
problem with the vectors and some calculations take very long because
the wrong access pattern or implementation is chosen.

Another problem could be that the mappers and reducers have too few
memory and spend a lot of time running garbage collections.

--sebastian


On 23.12.2013 22:14, Suneel Marthi wrote:
> Has anyone be successful running Streaming KMeans clustering on a large dataset (> 100,000 points)?
> 
> 
> It just seems to take a very long time (> 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off.
> 
> Its the same story when trying to run in sequential mode.
> 
> Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. 
> 
> 
> There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
> 
> We really need to have this working on datasets that are larger than 20K reuters datasets.
> 
> I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.
>