You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Paritosh Ranjan <pr...@xebia.com> on 2012/08/09 18:06:40 UTC

Re: KMeans job fails during 2nd iteration. Java Heap space

Ted,

Is the ball k-means related to the ball tree based implementation of 
knn, or its a different algorithm altogether?

On 09-08-2012 19:54, Ted Dunning wrote:
> Also, the knn package has a single pass k-means implementation that
> can easily handle 20,000 clusters or more.  This is done by using an
> approximate nearest neighbor algorithm inside the k-means
> implementation to decrease the time dependency on k to roughly log k.
>
> See http://github.com/tdunning/knn
>
> Any help in testing these new capabilities or plumbing them into the
> standard Mahout capabilities would be very much appreciated.
>
> On Thu, Aug 9, 2012 at 7:05 AM, Ted Dunning <te...@gmail.com> wrote:
>> The upcoming knn package has a file based matrix implementation that uses memory mapping to allow sharing a copy of a large matrix between processes and threads.
>>
>> Sent from my iPhone
>>
>> On Aug 9, 2012, at 1:48 AM, Abramov Pavel <p....@rambler-co.ru> wrote:
>>
>>> Hello,
>>>
>>> If think Zipf's law is relevant for my data. Thanks for idea about
>>> memory-mapping.
>>>
>>> 1) How can I "drop" extremely small/large clusters? There are 50% small
>>> clusters with only 1 member while 1 large cluster has 50% members. Is
>>> there a way to "split" large clusters with Kmeans?
>>>
>>> 2) Can I force Mahout not to use exact centroid but the closest point from
>>> my set? Any point has ~10 non-zero components while exact centroid is very
>>> dense (~200k).
>>>
>>>
>>> Thanks!
>>>
>>> Pavel
>>>
>>>
>>> 09.08.12 5:43 пользователь "Lance Norskog" <go...@gmail.com> написал:
>>>
>>>> If Zipf's Law is relevant, locality will be much better than random.
>>>> Maybe you need a Vector implementation that is backed by memory-mapped
>>>> files?
>>>>
>>>> On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel <p....@rambler-co.ru>
>>>> wrote:
>>>>> Thank you Jeff, Paritosh,
>>>>>
>>>>> Reducing cluster count from 1000 to 100 made my day. I will also try
>>>>> Canopy for initial cluster count.
>>>>> Unfortunately I don't know how to reduce my 200k dictionary. There are
>>>>> no
>>>>> unfrequent terms.
>>>>>
>>>>> I will provide you with Hadoop config shortly. But I am pretty sure I
>>>>> can't decrease number of Mappers/Reducers per node or increase memory
>>>>> limits. It will affect the whole cluster.
>>>>>
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Pavel
>>>>>
>>>>>
>>>>> 08.08.12 16:15 пользователь "Jeff Eastman" <jd...@windwardsolutions.com>
>>>>> написал:
>>>>>
>>>>>> Consider that each cluster retains 4 vectors in memory in each mapper
>>>>>> and reducer, and that these vectors tend to become more dense (through
>>>>>> addition of multiple sparse components) as iterations proceed. With 1000
>>>>>> clusters and 200k terms in your dictionary this will cause the heap
>>>>>> space to be consumed rapidly as you have noted. Some times you can work
>>>>>> around this problem by increasing your heap size on a per-job basis or
>>>>>> reducing the number of mappers and reducers allowed on each node. Also
>>>>>> be sure you are not launching reducers until all of your mapper tasks
>>>>>> have completed.
>>>>>>
>>>>>> In order to provide more help to you, it would be useful to understand
>>>>>> more about how your cluster is "well tuned". How many mappers & reducers
>>>>>> are you launching in parallel, the heapspace limits set for tasks on
>>>>>> each node, etc.
>>>>>>
>>>>>> For a quick test, try reducing the k to 500 or 100 to see how many
>>>>>> clusters you can reasonably compute with your dataset on your cluster.
>>>>>> Canopy is also a good way to get a feel for a good initial k, though it
>>>>>> can be hard to arrive at good T-values in some text clustering cases.
>>>>>> You, can also try hierarchical clustering with reduced k to stay under
>>>>>> your memory limits.
>>>>>>
>>>>>>
>>>>>> On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>>>>>>> A stacktrace of error would have helped in finding the exact error.
>>>>>>>
>>>>>>> However, number of clusters can create Heap Space problems ( If the
>>>>>>> vector dimension is also high ).
>>>>>>> Either try to reduce the number of initial clusters ( In my opinion,
>>>>>>> the best way to know about initial clusters is Canopy Clustering
>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering )
>>>>>>>
>>>>>>> or, try to reduce the dimension of the vectors.
>>>>>>>
>>>>>>> PS : you are also providing numClusters twice
>>>>>>>
>>>>>>> --numClusters 1000 \ --numClusters 5 \
>>>>>>>
>>>>>>> On 08-08-2012 10:42, Abramov Pavel wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I am trying to run KMeans example on 15 000 000 documents (seq2sparse
>>>>>>>> output).
>>>>>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms
>>>>>>>> document size (titles). seq2sparse produces 200 files 80 MB each.
>>>>>>>>
>>>>>>>> My job failed with Java heap space Error. 1st iteration passes while
>>>>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of
>>>>>>>> warnings, but it passes. Reduce phase of buildClusters fails with
>>>>>>>> "Java Heap space".
>>>>>>>>
>>>>>>>> I can not increase reducer/mapper memory in hadoop. My cluster is
>>>>>>>> tunned well.
>>>>>>>>
>>>>>>>> How can I avoid this situation? My cluster has 300 Mappers and 220
>>>>>>>> Reducers running 40 servers 8-core 12 GB RAM.
>>>>>>>>
>>>>>>>> Thanks in advance!
>>>>>>>>
>>>>>>>> Here is KMeans parameters:
>>>>>>>>
>>>>>>>> ------------------------------------------------
>>>>>>>> mahout kmeans -Dmapred.reduce.tasks=200 \
>>>>>>>> -i ...tfidf-vectors/  \
>>>>>>>> -o /tmp/clustering_results_kmeans/ \
>>>>>>>> --clusters /tmp/clusters/ \
>>>>>>>> --numClusters 1000 \
>>>>>>>> --numClusters 5 \
>>>>>>>> --overwrite \
>>>>>>>> --clustering
>>>>>>>> ------------------------------------------------
>>>>>>>>
>>>>>>>> Pavel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> goksron@gmail.com

Re: KMeans job fails during 2nd iteration. Java Heap space

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

+1 I like your plan to separate these two concerns. How about creating 
two JIRAs so these conversations can be localized and retained?


On 8/10/12 6:33 AM, Paritosh Ranjan wrote:
> To me, it looks like a two step implementation.
>
> a) Using KMeans++ for cluster initialization.
> b) Using ball KMeans for clustering
>
> Should we introduce KMeans++ intialization algorithm in mahout first 
> and see how it behaves by observing user reactions?
> Once we are satisfied with the quality and scalability of KMeans++ 
> initialization, we can try to plug in the ball KMeans.
>
> This two step process will also help in finding errors easily if there 
> would be any.
>
> ( I understand that putting in KMeans++ and Ball KMeans together will 
> provide the best performance and quality, however, I also think that 
> it can create chaos if something goes wrong, that's why I am proposing 
> to go step by step ).
>
> If you agree, then I can help by plugging in KMeans++ first.
>
> On 09-08-2012 22:39, Ted Dunning wrote:
>> It is a different algorithm altogether.
>>
>> See http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf for 
>> reference.
>>
>> The basic idea underlying the algorithm is two-fold:
>>
>> - pick good starting points that have high likelihood of being in the
>> core of each cluster
>>
>> - use points in the near neighborhood of the starting points to update
>> the centroids
>>
>> With well-clusterable data, this gives a clustering algorithm that
>> provably achieves high quality in a single step with high probability.
>>   With less well-clusterable data, the situation is less clear, but a
>> few iterations typically produces very good results.
>>
>> The initialization algorithm is similar to k-means++.  See wikipedia
>> for more about that.
>>
>> On Thu, Aug 9, 2012 at 9:06 AM, Paritosh Ranjan <pr...@xebia.com> 
>> wrote:
>>> Ted,
>>>
>>> Is the ball k-means related to the ball tree based implementation of 
>>> knn, or
>>> its a different algorithm altogether?
>>>
>>> On 09-08-2012 19:54, Ted Dunning wrote:
>>>> Also, the knn package has a single pass k-means implementation that
>>>> can easily handle 20,000 clusters or more.  This is done by using an
>>>> approximate nearest neighbor algorithm inside the k-means
>>>> implementation to decrease the time dependency on k to roughly log k.
>>>>
>>>> See http://github.com/tdunning/knn
>>>>
>>>> Any help in testing these new capabilities or plumbing them into the
>>>> standard Mahout capabilities would be very much appreciated.
>>>>
>>>> On Thu, Aug 9, 2012 at 7:05 AM, Ted Dunning <te...@gmail.com> 
>>>> wrote:
>>>>> The upcoming knn package has a file based matrix implementation 
>>>>> that uses
>>>>> memory mapping to allow sharing a copy of a large matrix between 
>>>>> processes
>>>>> and threads.
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Aug 9, 2012, at 1:48 AM, Abramov Pavel <p....@rambler-co.ru>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> If think Zipf's law is relevant for my data. Thanks for idea about
>>>>>> memory-mapping.
>>>>>>
>>>>>> 1) How can I "drop" extremely small/large clusters? There are 50% 
>>>>>> small
>>>>>> clusters with only 1 member while 1 large cluster has 50% 
>>>>>> members. Is
>>>>>> there a way to "split" large clusters with Kmeans?
>>>>>>
>>>>>> 2) Can I force Mahout not to use exact centroid but the closest 
>>>>>> point
>>>>>> from
>>>>>> my set? Any point has ~10 non-zero components while exact 
>>>>>> centroid is
>>>>>> very
>>>>>> dense (~200k).
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Pavel
>>>>>>
>>>>>>
>>>>>> 09.08.12 5:43 пользователь "Lance Norskog" <go...@gmail.com> 
>>>>>> написал:
>>>>>>
>>>>>>> If Zipf's Law is relevant, locality will be much better than 
>>>>>>> random.
>>>>>>> Maybe you need a Vector implementation that is backed by 
>>>>>>> memory-mapped
>>>>>>> files?
>>>>>>>
>>>>>>> On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel
>>>>>>> <p....@rambler-co.ru>
>>>>>>> wrote:
>>>>>>>> Thank you Jeff, Paritosh,
>>>>>>>>
>>>>>>>> Reducing cluster count from 1000 to 100 made my day. I will 
>>>>>>>> also try
>>>>>>>> Canopy for initial cluster count.
>>>>>>>> Unfortunately I don't know how to reduce my 200k dictionary. 
>>>>>>>> There are
>>>>>>>> no
>>>>>>>> unfrequent terms.
>>>>>>>>
>>>>>>>> I will provide you with Hadoop config shortly. But I am pretty 
>>>>>>>> sure I
>>>>>>>> can't decrease number of Mappers/Reducers per node or increase 
>>>>>>>> memory
>>>>>>>> limits. It will affect the whole cluster.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>>
>>>>>>>> Pavel
>>>>>>>>
>>>>>>>>
>>>>>>>> 08.08.12 16:15 пользователь "Jeff Eastman"
>>>>>>>> <jd...@windwardsolutions.com>
>>>>>>>> написал:
>>>>>>>>
>>>>>>>>> Consider that each cluster retains 4 vectors in memory in each 
>>>>>>>>> mapper
>>>>>>>>> and reducer, and that these vectors tend to become more dense
>>>>>>>>> (through
>>>>>>>>> addition of multiple sparse components) as iterations proceed. 
>>>>>>>>> With
>>>>>>>>> 1000
>>>>>>>>> clusters and 200k terms in your dictionary this will cause the 
>>>>>>>>> heap
>>>>>>>>> space to be consumed rapidly as you have noted. Some times you 
>>>>>>>>> can
>>>>>>>>> work
>>>>>>>>> around this problem by increasing your heap size on a per-job 
>>>>>>>>> basis
>>>>>>>>> or
>>>>>>>>> reducing the number of mappers and reducers allowed on each node.
>>>>>>>>> Also
>>>>>>>>> be sure you are not launching reducers until all of your 
>>>>>>>>> mapper tasks
>>>>>>>>> have completed.
>>>>>>>>>
>>>>>>>>> In order to provide more help to you, it would be useful to
>>>>>>>>> understand
>>>>>>>>> more about how your cluster is "well tuned". How many mappers &
>>>>>>>>> reducers
>>>>>>>>> are you launching in parallel, the heapspace limits set for 
>>>>>>>>> tasks on
>>>>>>>>> each node, etc.
>>>>>>>>>
>>>>>>>>> For a quick test, try reducing the k to 500 or 100 to see how 
>>>>>>>>> many
>>>>>>>>> clusters you can reasonably compute with your dataset on your
>>>>>>>>> cluster.
>>>>>>>>> Canopy is also a good way to get a feel for a good initial k, 
>>>>>>>>> though
>>>>>>>>> it
>>>>>>>>> can be hard to arrive at good T-values in some text clustering 
>>>>>>>>> cases.
>>>>>>>>> You, can also try hierarchical clustering with reduced k to stay
>>>>>>>>> under
>>>>>>>>> your memory limits.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>>>>>>>>>> A stacktrace of error would have helped in finding the exact 
>>>>>>>>>> error.
>>>>>>>>>>
>>>>>>>>>> However, number of clusters can create Heap Space problems ( 
>>>>>>>>>> If the
>>>>>>>>>> vector dimension is also high ).
>>>>>>>>>> Either try to reduce the number of initial clusters ( In my 
>>>>>>>>>> opinion,
>>>>>>>>>> the best way to know about initial clusters is Canopy Clustering
>>>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering 
>>>>>>>>>>
>>>>>>>>>> )
>>>>>>>>>>
>>>>>>>>>> or, try to reduce the dimension of the vectors.
>>>>>>>>>>
>>>>>>>>>> PS : you are also providing numClusters twice
>>>>>>>>>>
>>>>>>>>>> --numClusters 1000 \ --numClusters 5 \
>>>>>>>>>>
>>>>>>>>>> On 08-08-2012 10:42, Abramov Pavel wrote:
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> I am trying to run KMeans example on 15 000 000 documents
>>>>>>>>>>> (seq2sparse
>>>>>>>>>>> output).
>>>>>>>>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 
>>>>>>>>>>> terms
>>>>>>>>>>> document size (titles). seq2sparse produces 200 files 80 MB 
>>>>>>>>>>> each.
>>>>>>>>>>>
>>>>>>>>>>> My job failed with Java heap space Error. 1st iteration passes
>>>>>>>>>>> while
>>>>>>>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a 
>>>>>>>>>>> lot of
>>>>>>>>>>> warnings, but it passes. Reduce phase of buildClusters fails 
>>>>>>>>>>> with
>>>>>>>>>>> "Java Heap space".
>>>>>>>>>>>
>>>>>>>>>>> I can not increase reducer/mapper memory in hadoop. My 
>>>>>>>>>>> cluster is
>>>>>>>>>>> tunned well.
>>>>>>>>>>>
>>>>>>>>>>> How can I avoid this situation? My cluster has 300 Mappers 
>>>>>>>>>>> and 220
>>>>>>>>>>> Reducers running 40 servers 8-core 12 GB RAM.
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance!
>>>>>>>>>>>
>>>>>>>>>>> Here is KMeans parameters:
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>> mahout kmeans -Dmapred.reduce.tasks=200 \
>>>>>>>>>>> -i ...tfidf-vectors/  \
>>>>>>>>>>> -o /tmp/clustering_results_kmeans/ \
>>>>>>>>>>> --clusters /tmp/clusters/ \
>>>>>>>>>>> --numClusters 1000 \
>>>>>>>>>>> --numClusters 5 \
>>>>>>>>>>> --overwrite \
>>>>>>>>>>> --clustering
>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> Pavel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Lance Norskog
>>>>>>> goksron@gmail.com
>>>
>>>
>
>
>
>

Re: KMeans job fails during 2nd iteration. Java Heap space

Posted by Paritosh Ranjan <pr...@xebia.com>.

To me, it looks like a two step implementation.

a) Using KMeans++ for cluster initialization.
b) Using ball KMeans for clustering

Should we introduce KMeans++ intialization algorithm in mahout first and 
see how it behaves by observing user reactions?
Once we are satisfied with the quality and scalability of KMeans++ 
initialization, we can try to plug in the ball KMeans.

This two step process will also help in finding errors easily if there 
would be any.

( I understand that putting in KMeans++ and Ball KMeans together will 
provide the best performance and quality, however, I also think that it 
can create chaos if something goes wrong, that's why I am proposing to 
go step by step ).

If you agree, then I can help by plugging in KMeans++ first.

On 09-08-2012 22:39, Ted Dunning wrote:
> It is a different algorithm altogether.
>
> See http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf for reference.
>
> The basic idea underlying the algorithm is two-fold:
>
> - pick good starting points that have high likelihood of being in the
> core of each cluster
>
> - use points in the near neighborhood of the starting points to update
> the centroids
>
> With well-clusterable data, this gives a clustering algorithm that
> provably achieves high quality in a single step with high probability.
>   With less well-clusterable data, the situation is less clear, but a
> few iterations typically produces very good results.
>
> The initialization algorithm is similar to k-means++.  See wikipedia
> for more about that.
>
> On Thu, Aug 9, 2012 at 9:06 AM, Paritosh Ranjan <pr...@xebia.com> wrote:
>> Ted,
>>
>> Is the ball k-means related to the ball tree based implementation of knn, or
>> its a different algorithm altogether?
>>
>> On 09-08-2012 19:54, Ted Dunning wrote:
>>> Also, the knn package has a single pass k-means implementation that
>>> can easily handle 20,000 clusters or more.  This is done by using an
>>> approximate nearest neighbor algorithm inside the k-means
>>> implementation to decrease the time dependency on k to roughly log k.
>>>
>>> See http://github.com/tdunning/knn
>>>
>>> Any help in testing these new capabilities or plumbing them into the
>>> standard Mahout capabilities would be very much appreciated.
>>>
>>> On Thu, Aug 9, 2012 at 7:05 AM, Ted Dunning <te...@gmail.com> wrote:
>>>> The upcoming knn package has a file based matrix implementation that uses
>>>> memory mapping to allow sharing a copy of a large matrix between processes
>>>> and threads.
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Aug 9, 2012, at 1:48 AM, Abramov Pavel <p....@rambler-co.ru>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> If think Zipf's law is relevant for my data. Thanks for idea about
>>>>> memory-mapping.
>>>>>
>>>>> 1) How can I "drop" extremely small/large clusters? There are 50% small
>>>>> clusters with only 1 member while 1 large cluster has 50% members. Is
>>>>> there a way to "split" large clusters with Kmeans?
>>>>>
>>>>> 2) Can I force Mahout not to use exact centroid but the closest point
>>>>> from
>>>>> my set? Any point has ~10 non-zero components while exact centroid is
>>>>> very
>>>>> dense (~200k).
>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Pavel
>>>>>
>>>>>
>>>>> 09.08.12 5:43 пользователь "Lance Norskog" <go...@gmail.com> написал:
>>>>>
>>>>>> If Zipf's Law is relevant, locality will be much better than random.
>>>>>> Maybe you need a Vector implementation that is backed by memory-mapped
>>>>>> files?
>>>>>>
>>>>>> On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel
>>>>>> <p....@rambler-co.ru>
>>>>>> wrote:
>>>>>>> Thank you Jeff, Paritosh,
>>>>>>>
>>>>>>> Reducing cluster count from 1000 to 100 made my day. I will also try
>>>>>>> Canopy for initial cluster count.
>>>>>>> Unfortunately I don't know how to reduce my 200k dictionary. There are
>>>>>>> no
>>>>>>> unfrequent terms.
>>>>>>>
>>>>>>> I will provide you with Hadoop config shortly. But I am pretty sure I
>>>>>>> can't decrease number of Mappers/Reducers per node or increase memory
>>>>>>> limits. It will affect the whole cluster.
>>>>>>>
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>> Pavel
>>>>>>>
>>>>>>>
>>>>>>> 08.08.12 16:15 пользователь "Jeff Eastman"
>>>>>>> <jd...@windwardsolutions.com>
>>>>>>> написал:
>>>>>>>
>>>>>>>> Consider that each cluster retains 4 vectors in memory in each mapper
>>>>>>>> and reducer, and that these vectors tend to become more dense
>>>>>>>> (through
>>>>>>>> addition of multiple sparse components) as iterations proceed. With
>>>>>>>> 1000
>>>>>>>> clusters and 200k terms in your dictionary this will cause the heap
>>>>>>>> space to be consumed rapidly as you have noted. Some times you can
>>>>>>>> work
>>>>>>>> around this problem by increasing your heap size on a per-job basis
>>>>>>>> or
>>>>>>>> reducing the number of mappers and reducers allowed on each node.
>>>>>>>> Also
>>>>>>>> be sure you are not launching reducers until all of your mapper tasks
>>>>>>>> have completed.
>>>>>>>>
>>>>>>>> In order to provide more help to you, it would be useful to
>>>>>>>> understand
>>>>>>>> more about how your cluster is "well tuned". How many mappers &
>>>>>>>> reducers
>>>>>>>> are you launching in parallel, the heapspace limits set for tasks on
>>>>>>>> each node, etc.
>>>>>>>>
>>>>>>>> For a quick test, try reducing the k to 500 or 100 to see how many
>>>>>>>> clusters you can reasonably compute with your dataset on your
>>>>>>>> cluster.
>>>>>>>> Canopy is also a good way to get a feel for a good initial k, though
>>>>>>>> it
>>>>>>>> can be hard to arrive at good T-values in some text clustering cases.
>>>>>>>> You, can also try hierarchical clustering with reduced k to stay
>>>>>>>> under
>>>>>>>> your memory limits.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>>>>>>>>> A stacktrace of error would have helped in finding the exact error.
>>>>>>>>>
>>>>>>>>> However, number of clusters can create Heap Space problems ( If the
>>>>>>>>> vector dimension is also high ).
>>>>>>>>> Either try to reduce the number of initial clusters ( In my opinion,
>>>>>>>>> the best way to know about initial clusters is Canopy Clustering
>>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
>>>>>>>>> )
>>>>>>>>>
>>>>>>>>> or, try to reduce the dimension of the vectors.
>>>>>>>>>
>>>>>>>>> PS : you are also providing numClusters twice
>>>>>>>>>
>>>>>>>>> --numClusters 1000 \ --numClusters 5 \
>>>>>>>>>
>>>>>>>>> On 08-08-2012 10:42, Abramov Pavel wrote:
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I am trying to run KMeans example on 15 000 000 documents
>>>>>>>>>> (seq2sparse
>>>>>>>>>> output).
>>>>>>>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms
>>>>>>>>>> document size (titles). seq2sparse produces 200 files 80 MB each.
>>>>>>>>>>
>>>>>>>>>> My job failed with Java heap space Error. 1st iteration passes
>>>>>>>>>> while
>>>>>>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of
>>>>>>>>>> warnings, but it passes. Reduce phase of buildClusters fails with
>>>>>>>>>> "Java Heap space".
>>>>>>>>>>
>>>>>>>>>> I can not increase reducer/mapper memory in hadoop. My cluster is
>>>>>>>>>> tunned well.
>>>>>>>>>>
>>>>>>>>>> How can I avoid this situation? My cluster has 300 Mappers and 220
>>>>>>>>>> Reducers running 40 servers 8-core 12 GB RAM.
>>>>>>>>>>
>>>>>>>>>> Thanks in advance!
>>>>>>>>>>
>>>>>>>>>> Here is KMeans parameters:
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------
>>>>>>>>>> mahout kmeans -Dmapred.reduce.tasks=200 \
>>>>>>>>>> -i ...tfidf-vectors/  \
>>>>>>>>>> -o /tmp/clustering_results_kmeans/ \
>>>>>>>>>> --clusters /tmp/clusters/ \
>>>>>>>>>> --numClusters 1000 \
>>>>>>>>>> --numClusters 5 \
>>>>>>>>>> --overwrite \
>>>>>>>>>> --clustering
>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> Pavel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Lance Norskog
>>>>>> goksron@gmail.com
>>
>>

Re: KMeans job fails during 2nd iteration. Java Heap space

Posted by Ted Dunning <te...@gmail.com>.

It is a different algorithm altogether.

See http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf for reference.

The basic idea underlying the algorithm is two-fold:

- pick good starting points that have high likelihood of being in the
core of each cluster

- use points in the near neighborhood of the starting points to update
the centroids

With well-clusterable data, this gives a clustering algorithm that
provably achieves high quality in a single step with high probability.
 With less well-clusterable data, the situation is less clear, but a
few iterations typically produces very good results.

The initialization algorithm is similar to k-means++.  See wikipedia
for more about that.

On Thu, Aug 9, 2012 at 9:06 AM, Paritosh Ranjan <pr...@xebia.com> wrote:
> Ted,
>
> Is the ball k-means related to the ball tree based implementation of knn, or
> its a different algorithm altogether?
>
> On 09-08-2012 19:54, Ted Dunning wrote:
>>
>> Also, the knn package has a single pass k-means implementation that
>> can easily handle 20,000 clusters or more.  This is done by using an
>> approximate nearest neighbor algorithm inside the k-means
>> implementation to decrease the time dependency on k to roughly log k.
>>
>> See http://github.com/tdunning/knn
>>
>> Any help in testing these new capabilities or plumbing them into the
>> standard Mahout capabilities would be very much appreciated.
>>
>> On Thu, Aug 9, 2012 at 7:05 AM, Ted Dunning <te...@gmail.com> wrote:
>>>
>>> The upcoming knn package has a file based matrix implementation that uses
>>> memory mapping to allow sharing a copy of a large matrix between processes
>>> and threads.
>>>
>>> Sent from my iPhone
>>>
>>> On Aug 9, 2012, at 1:48 AM, Abramov Pavel <p....@rambler-co.ru>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> If think Zipf's law is relevant for my data. Thanks for idea about
>>>> memory-mapping.
>>>>
>>>> 1) How can I "drop" extremely small/large clusters? There are 50% small
>>>> clusters with only 1 member while 1 large cluster has 50% members. Is
>>>> there a way to "split" large clusters with Kmeans?
>>>>
>>>> 2) Can I force Mahout not to use exact centroid but the closest point
>>>> from
>>>> my set? Any point has ~10 non-zero components while exact centroid is
>>>> very
>>>> dense (~200k).
>>>>
>>>>
>>>> Thanks!
>>>>
>>>> Pavel
>>>>
>>>>
>>>> 09.08.12 5:43 пользователь "Lance Norskog" <go...@gmail.com> написал:
>>>>
>>>>> If Zipf's Law is relevant, locality will be much better than random.
>>>>> Maybe you need a Vector implementation that is backed by memory-mapped
>>>>> files?
>>>>>
>>>>> On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel
>>>>> <p....@rambler-co.ru>
>>>>> wrote:
>>>>>>
>>>>>> Thank you Jeff, Paritosh,
>>>>>>
>>>>>> Reducing cluster count from 1000 to 100 made my day. I will also try
>>>>>> Canopy for initial cluster count.
>>>>>> Unfortunately I don't know how to reduce my 200k dictionary. There are
>>>>>> no
>>>>>> unfrequent terms.
>>>>>>
>>>>>> I will provide you with Hadoop config shortly. But I am pretty sure I
>>>>>> can't decrease number of Mappers/Reducers per node or increase memory
>>>>>> limits. It will affect the whole cluster.
>>>>>>
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Pavel
>>>>>>
>>>>>>
>>>>>> 08.08.12 16:15 пользователь "Jeff Eastman"
>>>>>> <jd...@windwardsolutions.com>
>>>>>> написал:
>>>>>>
>>>>>>> Consider that each cluster retains 4 vectors in memory in each mapper
>>>>>>> and reducer, and that these vectors tend to become more dense
>>>>>>> (through
>>>>>>> addition of multiple sparse components) as iterations proceed. With
>>>>>>> 1000
>>>>>>> clusters and 200k terms in your dictionary this will cause the heap
>>>>>>> space to be consumed rapidly as you have noted. Some times you can
>>>>>>> work
>>>>>>> around this problem by increasing your heap size on a per-job basis
>>>>>>> or
>>>>>>> reducing the number of mappers and reducers allowed on each node.
>>>>>>> Also
>>>>>>> be sure you are not launching reducers until all of your mapper tasks
>>>>>>> have completed.
>>>>>>>
>>>>>>> In order to provide more help to you, it would be useful to
>>>>>>> understand
>>>>>>> more about how your cluster is "well tuned". How many mappers &
>>>>>>> reducers
>>>>>>> are you launching in parallel, the heapspace limits set for tasks on
>>>>>>> each node, etc.
>>>>>>>
>>>>>>> For a quick test, try reducing the k to 500 or 100 to see how many
>>>>>>> clusters you can reasonably compute with your dataset on your
>>>>>>> cluster.
>>>>>>> Canopy is also a good way to get a feel for a good initial k, though
>>>>>>> it
>>>>>>> can be hard to arrive at good T-values in some text clustering cases.
>>>>>>> You, can also try hierarchical clustering with reduced k to stay
>>>>>>> under
>>>>>>> your memory limits.
>>>>>>>
>>>>>>>
>>>>>>> On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>>>>>>>>
>>>>>>>> A stacktrace of error would have helped in finding the exact error.
>>>>>>>>
>>>>>>>> However, number of clusters can create Heap Space problems ( If the
>>>>>>>> vector dimension is also high ).
>>>>>>>> Either try to reduce the number of initial clusters ( In my opinion,
>>>>>>>> the best way to know about initial clusters is Canopy Clustering
>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
>>>>>>>> )
>>>>>>>>
>>>>>>>> or, try to reduce the dimension of the vectors.
>>>>>>>>
>>>>>>>> PS : you are also providing numClusters twice
>>>>>>>>
>>>>>>>> --numClusters 1000 \ --numClusters 5 \
>>>>>>>>
>>>>>>>> On 08-08-2012 10:42, Abramov Pavel wrote:
>>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I am trying to run KMeans example on 15 000 000 documents
>>>>>>>>> (seq2sparse
>>>>>>>>> output).
>>>>>>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms
>>>>>>>>> document size (titles). seq2sparse produces 200 files 80 MB each.
>>>>>>>>>
>>>>>>>>> My job failed with Java heap space Error. 1st iteration passes
>>>>>>>>> while
>>>>>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of
>>>>>>>>> warnings, but it passes. Reduce phase of buildClusters fails with
>>>>>>>>> "Java Heap space".
>>>>>>>>>
>>>>>>>>> I can not increase reducer/mapper memory in hadoop. My cluster is
>>>>>>>>> tunned well.
>>>>>>>>>
>>>>>>>>> How can I avoid this situation? My cluster has 300 Mappers and 220
>>>>>>>>> Reducers running 40 servers 8-core 12 GB RAM.
>>>>>>>>>
>>>>>>>>> Thanks in advance!
>>>>>>>>>
>>>>>>>>> Here is KMeans parameters:
>>>>>>>>>
>>>>>>>>> ------------------------------------------------
>>>>>>>>> mahout kmeans -Dmapred.reduce.tasks=200 \
>>>>>>>>> -i ...tfidf-vectors/  \
>>>>>>>>> -o /tmp/clustering_results_kmeans/ \
>>>>>>>>> --clusters /tmp/clusters/ \
>>>>>>>>> --numClusters 1000 \
>>>>>>>>> --numClusters 5 \
>>>>>>>>> --overwrite \
>>>>>>>>> --clustering
>>>>>>>>> ------------------------------------------------
>>>>>>>>>
>>>>>>>>> Pavel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Lance Norskog
>>>>> goksron@gmail.com
>
>
>