You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Abramov Pavel <p....@rambler-co.ru> on 2012/08/08 07:12:35 UTC

KMeans job fails during 2nd iteration. Java Heap space

Hello, 

I am trying to run KMeans example on 15 000 000 documents (seq2sparse output).
There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms document size (titles). seq2sparse produces 200 files 80 MB each.

My job failed with Java heap space Error. 1st iteration passes while 2nd iteration fails. On a Map phase of buildClusters I see a lot of warnings, but it passes. Reduce phase of buildClusters fails with "Java Heap space".

I can not increase reducer/mapper memory in hadoop. My cluster is tunned well.

How can I avoid this situation? My cluster has 300 Mappers and 220 Reducers running 40 servers 8-core 12 GB RAM.

Thanks in advance!

Here is KMeans parameters:

------------------------------------------------
mahout kmeans -Dmapred.reduce.tasks=200 \
-i ...tfidf-vectors/  \
-o /tmp/clustering_results_kmeans/ \
--clusters /tmp/clusters/ \
--numClusters 1000 \
--numClusters 5 \
--overwrite \
--clustering
------------------------------------------------

Pavel

Re: KMeans job fails during 2nd iteration. Java Heap space

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

+1 I like your plan to separate these two concerns. How about creating 
two JIRAs so these conversations can be localized and retained?


On 8/10/12 6:33 AM, Paritosh Ranjan wrote:
> To me, it looks like a two step implementation.
>
> a) Using KMeans++ for cluster initialization.
> b) Using ball KMeans for clustering
>
> Should we introduce KMeans++ intialization algorithm in mahout first 
> and see how it behaves by observing user reactions?
> Once we are satisfied with the quality and scalability of KMeans++ 
> initialization, we can try to plug in the ball KMeans.
>
> This two step process will also help in finding errors easily if there 
> would be any.
>
> ( I understand that putting in KMeans++ and Ball KMeans together will 
> provide the best performance and quality, however, I also think that 
> it can create chaos if something goes wrong, that's why I am proposing 
> to go step by step ).
>
> If you agree, then I can help by plugging in KMeans++ first.
>
> On 09-08-2012 22:39, Ted Dunning wrote:
>> It is a different algorithm altogether.
>>
>> See http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf for 
>> reference.
>>
>> The basic idea underlying the algorithm is two-fold:
>>
>> - pick good starting points that have high likelihood of being in the
>> core of each cluster
>>
>> - use points in the near neighborhood of the starting points to update
>> the centroids
>>
>> With well-clusterable data, this gives a clustering algorithm that
>> provably achieves high quality in a single step with high probability.
>>   With less well-clusterable data, the situation is less clear, but a
>> few iterations typically produces very good results.
>>
>> The initialization algorithm is similar to k-means++.  See wikipedia
>> for more about that.
>>
>> On Thu, Aug 9, 2012 at 9:06 AM, Paritosh Ranjan <pr...@xebia.com> 
>> wrote:
>>> Ted,
>>>
>>> Is the ball k-means related to the ball tree based implementation of 
>>> knn, or
>>> its a different algorithm altogether?
>>>
>>> On 09-08-2012 19:54, Ted Dunning wrote:
>>>> Also, the knn package has a single pass k-means implementation that
>>>> can easily handle 20,000 clusters or more.  This is done by using an
>>>> approximate nearest neighbor algorithm inside the k-means
>>>> implementation to decrease the time dependency on k to roughly log k.
>>>>
>>>> See http://github.com/tdunning/knn
>>>>
>>>> Any help in testing these new capabilities or plumbing them into the
>>>> standard Mahout capabilities would be very much appreciated.
>>>>
>>>> On Thu, Aug 9, 2012 at 7:05 AM, Ted Dunning <te...@gmail.com> 
>>>> wrote:
>>>>> The upcoming knn package has a file based matrix implementation 
>>>>> that uses
>>>>> memory mapping to allow sharing a copy of a large matrix between 
>>>>> processes
>>>>> and threads.
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Aug 9, 2012, at 1:48 AM, Abramov Pavel <p....@rambler-co.ru>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> If think Zipf's law is relevant for my data. Thanks for idea about
>>>>>> memory-mapping.
>>>>>>
>>>>>> 1) How can I "drop" extremely small/large clusters? There are 50% 
>>>>>> small
>>>>>> clusters with only 1 member while 1 large cluster has 50% 
>>>>>> members. Is
>>>>>> there a way to "split" large clusters with Kmeans?
>>>>>>
>>>>>> 2) Can I force Mahout not to use exact centroid but the closest 
>>>>>> point
>>>>>> from
>>>>>> my set? Any point has ~10 non-zero components while exact 
>>>>>> centroid is
>>>>>> very
>>>>>> dense (~200k).
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Pavel
>>>>>>
>>>>>>
>>>>>> 09.08.12 5:43 пользователь "Lance Norskog" <go...@gmail.com> 
>>>>>> написал:
>>>>>>
>>>>>>> If Zipf's Law is relevant, locality will be much better than 
>>>>>>> random.
>>>>>>> Maybe you need a Vector implementation that is backed by 
>>>>>>> memory-mapped
>>>>>>> files?
>>>>>>>
>>>>>>> On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel
>>>>>>> <p....@rambler-co.ru>
>>>>>>> wrote:
>>>>>>>> Thank you Jeff, Paritosh,
>>>>>>>>
>>>>>>>> Reducing cluster count from 1000 to 100 made my day. I will 
>>>>>>>> also try
>>>>>>>> Canopy for initial cluster count.
>>>>>>>> Unfortunately I don't know how to reduce my 200k dictionary. 
>>>>>>>> There are
>>>>>>>> no
>>>>>>>> unfrequent terms.
>>>>>>>>
>>>>>>>> I will provide you with Hadoop config shortly. But I am pretty 
>>>>>>>> sure I
>>>>>>>> can't decrease number of Mappers/Reducers per node or increase 
>>>>>>>> memory
>>>>>>>> limits. It will affect the whole cluster.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>>
>>>>>>>> Pavel
>>>>>>>>
>>>>>>>>
>>>>>>>> 08.08.12 16:15 пользователь "Jeff Eastman"
>>>>>>>> <jd...@windwardsolutions.com>
>>>>>>>> написал:
>>>>>>>>
>>>>>>>>> Consider that each cluster retains 4 vectors in memory in each 
>>>>>>>>> mapper
>>>>>>>>> and reducer, and that these vectors tend to become more dense
>>>>>>>>> (through
>>>>>>>>> addition of multiple sparse components) as iterations proceed. 
>>>>>>>>> With
>>>>>>>>> 1000
>>>>>>>>> clusters and 200k terms in your dictionary this will cause the 
>>>>>>>>> heap
>>>>>>>>> space to be consumed rapidly as you have noted. Some times you 
>>>>>>>>> can
>>>>>>>>> work
>>>>>>>>> around this problem by increasing your heap size on a per-job 
>>>>>>>>> basis
>>>>>>>>> or
>>>>>>>>> reducing the number of mappers and reducers allowed on each node.
>>>>>>>>> Also
>>>>>>>>> be sure you are not launching reducers until all of your 
>>>>>>>>> mapper tasks
>>>>>>>>> have completed.
>>>>>>>>>
>>>>>>>>> In order to provide more help to you, it would be useful to
>>>>>>>>> understand
>>>>>>>>> more about how your cluster is "well tuned". How many mappers &
>>>>>>>>> reducers
>>>>>>>>> are you launching in parallel, the heapspace limits set for 
>>>>>>>>> tasks on
>>>>>>>>> each node, etc.
>>>>>>>>>
>>>>>>>>> For a quick test, try reducing the k to 500 or 100 to see how 
>>>>>>>>> many
>>>>>>>>> clusters you can reasonably compute with your dataset on your
>>>>>>>>> cluster.
>>>>>>>>> Canopy is also a good way to get a feel for a good initial k, 
>>>>>>>>> though
>>>>>>>>> it
>>>>>>>>> can be hard to arrive at good T-values in some text clustering 
>>>>>>>>> cases.
>>>>>>>>> You, can also try hierarchical clustering with reduced k to stay
>>>>>>>>> under
>>>>>>>>> your memory limits.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>>>>>>>>>> A stacktrace of error would have helped in finding the exact 
>>>>>>>>>> error.
>>>>>>>>>>
>>>>>>>>>> However, number of clusters can create Heap Space problems ( 
>>>>>>>>>> If the
>>>>>>>>>> vector dimension is also high ).
>>>>>>>>>> Either try to reduce the number of initial clusters ( In my 
>>>>>>>>>> opinion,
>>>>>>>>>> the best way to know about initial clusters is Canopy Clustering
>>>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering 
>>>>>>>>>>
>>>>>>>>>> )
>>>>>>>>>>
>>>>>>>>>> or, try to reduce the dimension of the vectors.
>>>>>>>>>>
>>>>>>>>>> PS : you are also providing numClusters twice
>>>>>>>>>>
>>>>>>>>>> --numClusters 1000 \ --numClusters 5 \
>>>>>>>>>>
>>>>>>>>>> On 08-08-2012 10:42, Abramov Pavel wrote:
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> I am trying to run KMeans example on 15 000 000 documents
>>>>>>>>>>> (seq2sparse
>>>>>>>>>>> output).
>>>>>>>>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 
>>>>>>>>>>> terms
>>>>>>>>>>> document size (titles). seq2sparse produces 200 files 80 MB 
>>>>>>>>>>> each.
>>>>>>>>>>>
>>>>>>>>>>> My job failed with Java heap space Error. 1st iteration passes
>>>>>>>>>>> while
>>>>>>>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a 
>>>>>>>>>>> lot of
>>>>>>>>>>> warnings, but it passes. Reduce phase of buildClusters fails 
>>>>>>>>>>> with
>>>>>>>>>>> "Java Heap space".
>>>>>>>>>>>
>>>>>>>>>>> I can not increase reducer/mapper memory in hadoop. My 
>>>>>>>>>>> cluster is
>>>>>>>>>>> tunned well.
>>>>>>>>>>>
>>>>>>>>>>> How can I avoid this situation? My cluster has 300 Mappers 
>>>>>>>>>>> and 220
>>>>>>>>>>> Reducers running 40 servers 8-core 12 GB RAM.
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance!
>>>>>>>>>>>
>>>>>>>>>>> Here is KMeans parameters:
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>> mahout kmeans -Dmapred.reduce.tasks=200 \
>>>>>>>>>>> -i ...tfidf-vectors/  \
>>>>>>>>>>> -o /tmp/clustering_results_kmeans/ \
>>>>>>>>>>> --clusters /tmp/clusters/ \
>>>>>>>>>>> --numClusters 1000 \
>>>>>>>>>>> --numClusters 5 \
>>>>>>>>>>> --overwrite \
>>>>>>>>>>> --clustering
>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> Pavel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Lance Norskog
>>>>>>> goksron@gmail.com
>>>
>>>
>
>
>
>

Re: KMeans job fails during 2nd iteration. Java Heap space

Posted by Paritosh Ranjan <pr...@xebia.com>.

To me, it looks like a two step implementation.

a) Using KMeans++ for cluster initialization.
b) Using ball KMeans for clustering

Should we introduce KMeans++ intialization algorithm in mahout first and 
see how it behaves by observing user reactions?
Once we are satisfied with the quality and scalability of KMeans++ 
initialization, we can try to plug in the ball KMeans.

This two step process will also help in finding errors easily if there 
would be any.

( I understand that putting in KMeans++ and Ball KMeans together will 
provide the best performance and quality, however, I also think that it 
can create chaos if something goes wrong, that's why I am proposing to 
go step by step ).

If you agree, then I can help by plugging in KMeans++ first.

On 09-08-2012 22:39, Ted Dunning wrote:
> It is a different algorithm altogether.
>
> See http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf for reference.
>
> The basic idea underlying the algorithm is two-fold:
>
> - pick good starting points that have high likelihood of being in the
> core of each cluster
>
> - use points in the near neighborhood of the starting points to update
> the centroids
>
> With well-clusterable data, this gives a clustering algorithm that
> provably achieves high quality in a single step with high probability.
>   With less well-clusterable data, the situation is less clear, but a
> few iterations typically produces very good results.
>
> The initialization algorithm is similar to k-means++.  See wikipedia
> for more about that.
>
> On Thu, Aug 9, 2012 at 9:06 AM, Paritosh Ranjan <pr...@xebia.com> wrote:
>> Ted,
>>
>> Is the ball k-means related to the ball tree based implementation of knn, or
>> its a different algorithm altogether?
>>
>> On 09-08-2012 19:54, Ted Dunning wrote:
>>> Also, the knn package has a single pass k-means implementation that
>>> can easily handle 20,000 clusters or more.  This is done by using an
>>> approximate nearest neighbor algorithm inside the k-means
>>> implementation to decrease the time dependency on k to roughly log k.
>>>
>>> See http://github.com/tdunning/knn
>>>
>>> Any help in testing these new capabilities or plumbing them into the
>>> standard Mahout capabilities would be very much appreciated.
>>>
>>> On Thu, Aug 9, 2012 at 7:05 AM, Ted Dunning <te...@gmail.com> wrote:
>>>> The upcoming knn package has a file based matrix implementation that uses
>>>> memory mapping to allow sharing a copy of a large matrix between processes
>>>> and threads.
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Aug 9, 2012, at 1:48 AM, Abramov Pavel <p....@rambler-co.ru>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> If think Zipf's law is relevant for my data. Thanks for idea about
>>>>> memory-mapping.
>>>>>
>>>>> 1) How can I "drop" extremely small/large clusters? There are 50% small
>>>>> clusters with only 1 member while 1 large cluster has 50% members. Is
>>>>> there a way to "split" large clusters with Kmeans?
>>>>>
>>>>> 2) Can I force Mahout not to use exact centroid but the closest point
>>>>> from
>>>>> my set? Any point has ~10 non-zero components while exact centroid is
>>>>> very
>>>>> dense (~200k).
>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Pavel
>>>>>
>>>>>
>>>>> 09.08.12 5:43 пользователь "Lance Norskog" <go...@gmail.com> написал:
>>>>>
>>>>>> If Zipf's Law is relevant, locality will be much better than random.
>>>>>> Maybe you need a Vector implementation that is backed by memory-mapped
>>>>>> files?
>>>>>>
>>>>>> On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel
>>>>>> <p....@rambler-co.ru>
>>>>>> wrote:
>>>>>>> Thank you Jeff, Paritosh,
>>>>>>>
>>>>>>> Reducing cluster count from 1000 to 100 made my day. I will also try
>>>>>>> Canopy for initial cluster count.
>>>>>>> Unfortunately I don't know how to reduce my 200k dictionary. There are
>>>>>>> no
>>>>>>> unfrequent terms.
>>>>>>>
>>>>>>> I will provide you with Hadoop config shortly. But I am pretty sure I
>>>>>>> can't decrease number of Mappers/Reducers per node or increase memory
>>>>>>> limits. It will affect the whole cluster.
>>>>>>>
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>> Pavel
>>>>>>>
>>>>>>>
>>>>>>> 08.08.12 16:15 пользователь "Jeff Eastman"
>>>>>>> <jd...@windwardsolutions.com>
>>>>>>> написал:
>>>>>>>
>>>>>>>> Consider that each cluster retains 4 vectors in memory in each mapper
>>>>>>>> and reducer, and that these vectors tend to become more dense
>>>>>>>> (through
>>>>>>>> addition of multiple sparse components) as iterations proceed. With
>>>>>>>> 1000
>>>>>>>> clusters and 200k terms in your dictionary this will cause the heap
>>>>>>>> space to be consumed rapidly as you have noted. Some times you can
>>>>>>>> work
>>>>>>>> around this problem by increasing your heap size on a per-job basis
>>>>>>>> or
>>>>>>>> reducing the number of mappers and reducers allowed on each node.
>>>>>>>> Also
>>>>>>>> be sure you are not launching reducers until all of your mapper tasks
>>>>>>>> have completed.
>>>>>>>>
>>>>>>>> In order to provide more help to you, it would be useful to
>>>>>>>> understand
>>>>>>>> more about how your cluster is "well tuned". How many mappers &
>>>>>>>> reducers
>>>>>>>> are you launching in parallel, the heapspace limits set for tasks on
>>>>>>>> each node, etc.
>>>>>>>>
>>>>>>>> For a quick test, try reducing the k to 500 or 100 to see how many
>>>>>>>> clusters you can reasonably compute with your dataset on your
>>>>>>>> cluster.
>>>>>>>> Canopy is also a good way to get a feel for a good initial k, though
>>>>>>>> it
>>>>>>>> can be hard to arrive at good T-values in some text clustering cases.
>>>>>>>> You, can also try hierarchical clustering with reduced k to stay
>>>>>>>> under
>>>>>>>> your memory limits.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>>>>>>>>> A stacktrace of error would have helped in finding the exact error.
>>>>>>>>>
>>>>>>>>> However, number of clusters can create Heap Space problems ( If the
>>>>>>>>> vector dimension is also high ).
>>>>>>>>> Either try to reduce the number of initial clusters ( In my opinion,
>>>>>>>>> the best way to know about initial clusters is Canopy Clustering
>>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
>>>>>>>>> )
>>>>>>>>>
>>>>>>>>> or, try to reduce the dimension of the vectors.
>>>>>>>>>
>>>>>>>>> PS : you are also providing numClusters twice
>>>>>>>>>
>>>>>>>>> --numClusters 1000 \ --numClusters 5 \
>>>>>>>>>
>>>>>>>>> On 08-08-2012 10:42, Abramov Pavel wrote:
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I am trying to run KMeans example on 15 000 000 documents
>>>>>>>>>> (seq2sparse
>>>>>>>>>> output).
>>>>>>>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms
>>>>>>>>>> document size (titles). seq2sparse produces 200 files 80 MB each.
>>>>>>>>>>
>>>>>>>>>> My job failed with Java heap space Error. 1st iteration passes
>>>>>>>>>> while
>>>>>>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of
>>>>>>>>>> warnings, but it passes. Reduce phase of buildClusters fails with
>>>>>>>>>> "Java Heap space".
>>>>>>>>>>
>>>>>>>>>> I can not increase reducer/mapper memory in hadoop. My cluster is
>>>>>>>>>> tunned well.
>>>>>>>>>>
>>>>>>>>>> How can I avoid this situation? My cluster has 300 Mappers and 220
>>>>>>>>>> Reducers running 40 servers 8-core 12 GB RAM.
>>>>>>>>>>
>>>>>>>>>> Thanks in advance!
>>>>>>>>>>
>>>>>>>>>> Here is KMeans parameters:
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------
>>>>>>>>>> mahout kmeans -Dmapred.reduce.tasks=200 \
>>>>>>>>>> -i ...tfidf-vectors/  \
>>>>>>>>>> -o /tmp/clustering_results_kmeans/ \
>>>>>>>>>> --clusters /tmp/clusters/ \
>>>>>>>>>> --numClusters 1000 \
>>>>>>>>>> --numClusters 5 \
>>>>>>>>>> --overwrite \
>>>>>>>>>> --clustering
>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> Pavel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Lance Norskog
>>>>>> goksron@gmail.com
>>
>>

Re: KMeans job fails during 2nd iteration. Java Heap space

Posted by Ted Dunning <te...@gmail.com>.

It is a different algorithm altogether.

See http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf for reference.

The basic idea underlying the algorithm is two-fold:

- pick good starting points that have high likelihood of being in the
core of each cluster

- use points in the near neighborhood of the starting points to update
the centroids

With well-clusterable data, this gives a clustering algorithm that
provably achieves high quality in a single step with high probability.
 With less well-clusterable data, the situation is less clear, but a
few iterations typically produces very good results.

The initialization algorithm is similar to k-means++.  See wikipedia
for more about that.

On Thu, Aug 9, 2012 at 9:06 AM, Paritosh Ranjan <pr...@xebia.com> wrote:
> Ted,
>
> Is the ball k-means related to the ball tree based implementation of knn, or
> its a different algorithm altogether?
>
> On 09-08-2012 19:54, Ted Dunning wrote:
>>
>> Also, the knn package has a single pass k-means implementation that
>> can easily handle 20,000 clusters or more.  This is done by using an
>> approximate nearest neighbor algorithm inside the k-means
>> implementation to decrease the time dependency on k to roughly log k.
>>
>> See http://github.com/tdunning/knn
>>
>> Any help in testing these new capabilities or plumbing them into the
>> standard Mahout capabilities would be very much appreciated.
>>
>> On Thu, Aug 9, 2012 at 7:05 AM, Ted Dunning <te...@gmail.com> wrote:
>>>
>>> The upcoming knn package has a file based matrix implementation that uses
>>> memory mapping to allow sharing a copy of a large matrix between processes
>>> and threads.
>>>
>>> Sent from my iPhone
>>>
>>> On Aug 9, 2012, at 1:48 AM, Abramov Pavel <p....@rambler-co.ru>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> If think Zipf's law is relevant for my data. Thanks for idea about
>>>> memory-mapping.
>>>>
>>>> 1) How can I "drop" extremely small/large clusters? There are 50% small
>>>> clusters with only 1 member while 1 large cluster has 50% members. Is
>>>> there a way to "split" large clusters with Kmeans?
>>>>
>>>> 2) Can I force Mahout not to use exact centroid but the closest point
>>>> from
>>>> my set? Any point has ~10 non-zero components while exact centroid is
>>>> very
>>>> dense (~200k).
>>>>
>>>>
>>>> Thanks!
>>>>
>>>> Pavel
>>>>
>>>>
>>>> 09.08.12 5:43 пользователь "Lance Norskog" <go...@gmail.com> написал:
>>>>
>>>>> If Zipf's Law is relevant, locality will be much better than random.
>>>>> Maybe you need a Vector implementation that is backed by memory-mapped
>>>>> files?
>>>>>
>>>>> On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel
>>>>> <p....@rambler-co.ru>
>>>>> wrote:
>>>>>>
>>>>>> Thank you Jeff, Paritosh,
>>>>>>
>>>>>> Reducing cluster count from 1000 to 100 made my day. I will also try
>>>>>> Canopy for initial cluster count.
>>>>>> Unfortunately I don't know how to reduce my 200k dictionary. There are
>>>>>> no
>>>>>> unfrequent terms.
>>>>>>
>>>>>> I will provide you with Hadoop config shortly. But I am pretty sure I
>>>>>> can't decrease number of Mappers/Reducers per node or increase memory
>>>>>> limits. It will affect the whole cluster.
>>>>>>
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Pavel
>>>>>>
>>>>>>
>>>>>> 08.08.12 16:15 пользователь "Jeff Eastman"
>>>>>> <jd...@windwardsolutions.com>
>>>>>> написал:
>>>>>>
>>>>>>> Consider that each cluster retains 4 vectors in memory in each mapper
>>>>>>> and reducer, and that these vectors tend to become more dense
>>>>>>> (through
>>>>>>> addition of multiple sparse components) as iterations proceed. With
>>>>>>> 1000
>>>>>>> clusters and 200k terms in your dictionary this will cause the heap
>>>>>>> space to be consumed rapidly as you have noted. Some times you can
>>>>>>> work
>>>>>>> around this problem by increasing your heap size on a per-job basis
>>>>>>> or
>>>>>>> reducing the number of mappers and reducers allowed on each node.
>>>>>>> Also
>>>>>>> be sure you are not launching reducers until all of your mapper tasks
>>>>>>> have completed.
>>>>>>>
>>>>>>> In order to provide more help to you, it would be useful to
>>>>>>> understand
>>>>>>> more about how your cluster is "well tuned". How many mappers &
>>>>>>> reducers
>>>>>>> are you launching in parallel, the heapspace limits set for tasks on
>>>>>>> each node, etc.
>>>>>>>
>>>>>>> For a quick test, try reducing the k to 500 or 100 to see how many
>>>>>>> clusters you can reasonably compute with your dataset on your
>>>>>>> cluster.
>>>>>>> Canopy is also a good way to get a feel for a good initial k, though
>>>>>>> it
>>>>>>> can be hard to arrive at good T-values in some text clustering cases.
>>>>>>> You, can also try hierarchical clustering with reduced k to stay
>>>>>>> under
>>>>>>> your memory limits.
>>>>>>>
>>>>>>>
>>>>>>> On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>>>>>>>>
>>>>>>>> A stacktrace of error would have helped in finding the exact error.
>>>>>>>>
>>>>>>>> However, number of clusters can create Heap Space problems ( If the
>>>>>>>> vector dimension is also high ).
>>>>>>>> Either try to reduce the number of initial clusters ( In my opinion,
>>>>>>>> the best way to know about initial clusters is Canopy Clustering
>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
>>>>>>>> )
>>>>>>>>
>>>>>>>> or, try to reduce the dimension of the vectors.
>>>>>>>>
>>>>>>>> PS : you are also providing numClusters twice
>>>>>>>>
>>>>>>>> --numClusters 1000 \ --numClusters 5 \
>>>>>>>>
>>>>>>>> On 08-08-2012 10:42, Abramov Pavel wrote:
>>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I am trying to run KMeans example on 15 000 000 documents
>>>>>>>>> (seq2sparse
>>>>>>>>> output).
>>>>>>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms
>>>>>>>>> document size (titles). seq2sparse produces 200 files 80 MB each.
>>>>>>>>>
>>>>>>>>> My job failed with Java heap space Error. 1st iteration passes
>>>>>>>>> while
>>>>>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of
>>>>>>>>> warnings, but it passes. Reduce phase of buildClusters fails with
>>>>>>>>> "Java Heap space".
>>>>>>>>>
>>>>>>>>> I can not increase reducer/mapper memory in hadoop. My cluster is
>>>>>>>>> tunned well.
>>>>>>>>>
>>>>>>>>> How can I avoid this situation? My cluster has 300 Mappers and 220
>>>>>>>>> Reducers running 40 servers 8-core 12 GB RAM.
>>>>>>>>>
>>>>>>>>> Thanks in advance!
>>>>>>>>>
>>>>>>>>> Here is KMeans parameters:
>>>>>>>>>
>>>>>>>>> ------------------------------------------------
>>>>>>>>> mahout kmeans -Dmapred.reduce.tasks=200 \
>>>>>>>>> -i ...tfidf-vectors/  \
>>>>>>>>> -o /tmp/clustering_results_kmeans/ \
>>>>>>>>> --clusters /tmp/clusters/ \
>>>>>>>>> --numClusters 1000 \
>>>>>>>>> --numClusters 5 \
>>>>>>>>> --overwrite \
>>>>>>>>> --clustering
>>>>>>>>> ------------------------------------------------
>>>>>>>>>
>>>>>>>>> Pavel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Lance Norskog
>>>>> goksron@gmail.com
>
>
>

Re: KMeans job fails during 2nd iteration. Java Heap space

Posted by Paritosh Ranjan <pr...@xebia.com>.

Ted,

Is the ball k-means related to the ball tree based implementation of 
knn, or its a different algorithm altogether?

On 09-08-2012 19:54, Ted Dunning wrote:
> Also, the knn package has a single pass k-means implementation that
> can easily handle 20,000 clusters or more.  This is done by using an
> approximate nearest neighbor algorithm inside the k-means
> implementation to decrease the time dependency on k to roughly log k.
>
> See http://github.com/tdunning/knn
>
> Any help in testing these new capabilities or plumbing them into the
> standard Mahout capabilities would be very much appreciated.
>
> On Thu, Aug 9, 2012 at 7:05 AM, Ted Dunning <te...@gmail.com> wrote:
>> The upcoming knn package has a file based matrix implementation that uses memory mapping to allow sharing a copy of a large matrix between processes and threads.
>>
>> Sent from my iPhone
>>
>> On Aug 9, 2012, at 1:48 AM, Abramov Pavel <p....@rambler-co.ru> wrote:
>>
>>> Hello,
>>>
>>> If think Zipf's law is relevant for my data. Thanks for idea about
>>> memory-mapping.
>>>
>>> 1) How can I "drop" extremely small/large clusters? There are 50% small
>>> clusters with only 1 member while 1 large cluster has 50% members. Is
>>> there a way to "split" large clusters with Kmeans?
>>>
>>> 2) Can I force Mahout not to use exact centroid but the closest point from
>>> my set? Any point has ~10 non-zero components while exact centroid is very
>>> dense (~200k).
>>>
>>>
>>> Thanks!
>>>
>>> Pavel
>>>
>>>
>>> 09.08.12 5:43 пользователь "Lance Norskog" <go...@gmail.com> написал:
>>>
>>>> If Zipf's Law is relevant, locality will be much better than random.
>>>> Maybe you need a Vector implementation that is backed by memory-mapped
>>>> files?
>>>>
>>>> On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel <p....@rambler-co.ru>
>>>> wrote:
>>>>> Thank you Jeff, Paritosh,
>>>>>
>>>>> Reducing cluster count from 1000 to 100 made my day. I will also try
>>>>> Canopy for initial cluster count.
>>>>> Unfortunately I don't know how to reduce my 200k dictionary. There are
>>>>> no
>>>>> unfrequent terms.
>>>>>
>>>>> I will provide you with Hadoop config shortly. But I am pretty sure I
>>>>> can't decrease number of Mappers/Reducers per node or increase memory
>>>>> limits. It will affect the whole cluster.
>>>>>
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Pavel
>>>>>
>>>>>
>>>>> 08.08.12 16:15 пользователь "Jeff Eastman" <jd...@windwardsolutions.com>
>>>>> написал:
>>>>>
>>>>>> Consider that each cluster retains 4 vectors in memory in each mapper
>>>>>> and reducer, and that these vectors tend to become more dense (through
>>>>>> addition of multiple sparse components) as iterations proceed. With 1000
>>>>>> clusters and 200k terms in your dictionary this will cause the heap
>>>>>> space to be consumed rapidly as you have noted. Some times you can work
>>>>>> around this problem by increasing your heap size on a per-job basis or
>>>>>> reducing the number of mappers and reducers allowed on each node. Also
>>>>>> be sure you are not launching reducers until all of your mapper tasks
>>>>>> have completed.
>>>>>>
>>>>>> In order to provide more help to you, it would be useful to understand
>>>>>> more about how your cluster is "well tuned". How many mappers & reducers
>>>>>> are you launching in parallel, the heapspace limits set for tasks on
>>>>>> each node, etc.
>>>>>>
>>>>>> For a quick test, try reducing the k to 500 or 100 to see how many
>>>>>> clusters you can reasonably compute with your dataset on your cluster.
>>>>>> Canopy is also a good way to get a feel for a good initial k, though it
>>>>>> can be hard to arrive at good T-values in some text clustering cases.
>>>>>> You, can also try hierarchical clustering with reduced k to stay under
>>>>>> your memory limits.
>>>>>>
>>>>>>
>>>>>> On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>>>>>>> A stacktrace of error would have helped in finding the exact error.
>>>>>>>
>>>>>>> However, number of clusters can create Heap Space problems ( If the
>>>>>>> vector dimension is also high ).
>>>>>>> Either try to reduce the number of initial clusters ( In my opinion,
>>>>>>> the best way to know about initial clusters is Canopy Clustering
>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering )
>>>>>>>
>>>>>>> or, try to reduce the dimension of the vectors.
>>>>>>>
>>>>>>> PS : you are also providing numClusters twice
>>>>>>>
>>>>>>> --numClusters 1000 \ --numClusters 5 \
>>>>>>>
>>>>>>> On 08-08-2012 10:42, Abramov Pavel wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I am trying to run KMeans example on 15 000 000 documents (seq2sparse
>>>>>>>> output).
>>>>>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms
>>>>>>>> document size (titles). seq2sparse produces 200 files 80 MB each.
>>>>>>>>
>>>>>>>> My job failed with Java heap space Error. 1st iteration passes while
>>>>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of
>>>>>>>> warnings, but it passes. Reduce phase of buildClusters fails with
>>>>>>>> "Java Heap space".
>>>>>>>>
>>>>>>>> I can not increase reducer/mapper memory in hadoop. My cluster is
>>>>>>>> tunned well.
>>>>>>>>
>>>>>>>> How can I avoid this situation? My cluster has 300 Mappers and 220
>>>>>>>> Reducers running 40 servers 8-core 12 GB RAM.
>>>>>>>>
>>>>>>>> Thanks in advance!
>>>>>>>>
>>>>>>>> Here is KMeans parameters:
>>>>>>>>
>>>>>>>> ------------------------------------------------
>>>>>>>> mahout kmeans -Dmapred.reduce.tasks=200 \
>>>>>>>> -i ...tfidf-vectors/  \
>>>>>>>> -o /tmp/clustering_results_kmeans/ \
>>>>>>>> --clusters /tmp/clusters/ \
>>>>>>>> --numClusters 1000 \
>>>>>>>> --numClusters 5 \
>>>>>>>> --overwrite \
>>>>>>>> --clustering
>>>>>>>> ------------------------------------------------
>>>>>>>>
>>>>>>>> Pavel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> goksron@gmail.com

Re: KMeans job fails during 2nd iteration. Java Heap space

Posted by Ted Dunning <te...@gmail.com>.

Also, the knn package has a single pass k-means implementation that
can easily handle 20,000 clusters or more.  This is done by using an
approximate nearest neighbor algorithm inside the k-means
implementation to decrease the time dependency on k to roughly log k.

See http://github.com/tdunning/knn

Any help in testing these new capabilities or plumbing them into the
standard Mahout capabilities would be very much appreciated.

On Thu, Aug 9, 2012 at 7:05 AM, Ted Dunning <te...@gmail.com> wrote:
> The upcoming knn package has a file based matrix implementation that uses memory mapping to allow sharing a copy of a large matrix between processes and threads.
>
> Sent from my iPhone
>
> On Aug 9, 2012, at 1:48 AM, Abramov Pavel <p....@rambler-co.ru> wrote:
>
>> Hello,
>>
>> If think Zipf's law is relevant for my data. Thanks for idea about
>> memory-mapping.
>>
>> 1) How can I "drop" extremely small/large clusters? There are 50% small
>> clusters with only 1 member while 1 large cluster has 50% members. Is
>> there a way to "split" large clusters with Kmeans?
>>
>> 2) Can I force Mahout not to use exact centroid but the closest point from
>> my set? Any point has ~10 non-zero components while exact centroid is very
>> dense (~200k).
>>
>>
>> Thanks!
>>
>> Pavel
>>
>>
>> 09.08.12 5:43 пользователь "Lance Norskog" <go...@gmail.com> написал:
>>
>>> If Zipf's Law is relevant, locality will be much better than random.
>>> Maybe you need a Vector implementation that is backed by memory-mapped
>>> files?
>>>
>>> On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel <p....@rambler-co.ru>
>>> wrote:
>>>> Thank you Jeff, Paritosh,
>>>>
>>>> Reducing cluster count from 1000 to 100 made my day. I will also try
>>>> Canopy for initial cluster count.
>>>> Unfortunately I don't know how to reduce my 200k dictionary. There are
>>>> no
>>>> unfrequent terms.
>>>>
>>>> I will provide you with Hadoop config shortly. But I am pretty sure I
>>>> can't decrease number of Mappers/Reducers per node or increase memory
>>>> limits. It will affect the whole cluster.
>>>>
>>>>
>>>> Thank you!
>>>>
>>>> Pavel
>>>>
>>>>
>>>> 08.08.12 16:15 пользователь "Jeff Eastman" <jd...@windwardsolutions.com>
>>>> написал:
>>>>
>>>>> Consider that each cluster retains 4 vectors in memory in each mapper
>>>>> and reducer, and that these vectors tend to become more dense (through
>>>>> addition of multiple sparse components) as iterations proceed. With 1000
>>>>> clusters and 200k terms in your dictionary this will cause the heap
>>>>> space to be consumed rapidly as you have noted. Some times you can work
>>>>> around this problem by increasing your heap size on a per-job basis or
>>>>> reducing the number of mappers and reducers allowed on each node. Also
>>>>> be sure you are not launching reducers until all of your mapper tasks
>>>>> have completed.
>>>>>
>>>>> In order to provide more help to you, it would be useful to understand
>>>>> more about how your cluster is "well tuned". How many mappers & reducers
>>>>> are you launching in parallel, the heapspace limits set for tasks on
>>>>> each node, etc.
>>>>>
>>>>> For a quick test, try reducing the k to 500 or 100 to see how many
>>>>> clusters you can reasonably compute with your dataset on your cluster.
>>>>> Canopy is also a good way to get a feel for a good initial k, though it
>>>>> can be hard to arrive at good T-values in some text clustering cases.
>>>>> You, can also try hierarchical clustering with reduced k to stay under
>>>>> your memory limits.
>>>>>
>>>>>
>>>>> On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>>>>>> A stacktrace of error would have helped in finding the exact error.
>>>>>>
>>>>>> However, number of clusters can create Heap Space problems ( If the
>>>>>> vector dimension is also high ).
>>>>>> Either try to reduce the number of initial clusters ( In my opinion,
>>>>>> the best way to know about initial clusters is Canopy Clustering
>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering )
>>>>>>
>>>>>> or, try to reduce the dimension of the vectors.
>>>>>>
>>>>>> PS : you are also providing numClusters twice
>>>>>>
>>>>>> --numClusters 1000 \ --numClusters 5 \
>>>>>>
>>>>>> On 08-08-2012 10:42, Abramov Pavel wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I am trying to run KMeans example on 15 000 000 documents (seq2sparse
>>>>>>> output).
>>>>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms
>>>>>>> document size (titles). seq2sparse produces 200 files 80 MB each.
>>>>>>>
>>>>>>> My job failed with Java heap space Error. 1st iteration passes while
>>>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of
>>>>>>> warnings, but it passes. Reduce phase of buildClusters fails with
>>>>>>> "Java Heap space".
>>>>>>>
>>>>>>> I can not increase reducer/mapper memory in hadoop. My cluster is
>>>>>>> tunned well.
>>>>>>>
>>>>>>> How can I avoid this situation? My cluster has 300 Mappers and 220
>>>>>>> Reducers running 40 servers 8-core 12 GB RAM.
>>>>>>>
>>>>>>> Thanks in advance!
>>>>>>>
>>>>>>> Here is KMeans parameters:
>>>>>>>
>>>>>>> ------------------------------------------------
>>>>>>> mahout kmeans -Dmapred.reduce.tasks=200 \
>>>>>>> -i ...tfidf-vectors/  \
>>>>>>> -o /tmp/clustering_results_kmeans/ \
>>>>>>> --clusters /tmp/clusters/ \
>>>>>>> --numClusters 1000 \
>>>>>>> --numClusters 5 \
>>>>>>> --overwrite \
>>>>>>> --clustering
>>>>>>> ------------------------------------------------
>>>>>>>
>>>>>>> Pavel
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>

Re: KMeans job fails during 2nd iteration. Java Heap space

Posted by Ted Dunning <te...@gmail.com>.

The upcoming knn package has a file based matrix implementation that uses memory mapping to allow sharing a copy of a large matrix between processes and threads.  

Sent from my iPhone

On Aug 9, 2012, at 1:48 AM, Abramov Pavel <p....@rambler-co.ru> wrote:

> Hello, 
> 
> If think Zipf's law is relevant for my data. Thanks for idea about
> memory-mapping.
> 
> 1) How can I "drop" extremely small/large clusters? There are 50% small
> clusters with only 1 member while 1 large cluster has 50% members. Is
> there a way to "split" large clusters with Kmeans?
> 
> 2) Can I force Mahout not to use exact centroid but the closest point from
> my set? Any point has ~10 non-zero components while exact centroid is very
> dense (~200k).
> 
> 
> Thanks!
> 
> Pavel
> 
> 
> 09.08.12 5:43 пользователь "Lance Norskog" <go...@gmail.com> написал:
> 
>> If Zipf's Law is relevant, locality will be much better than random.
>> Maybe you need a Vector implementation that is backed by memory-mapped
>> files?
>> 
>> On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel <p....@rambler-co.ru>
>> wrote:
>>> Thank you Jeff, Paritosh,
>>> 
>>> Reducing cluster count from 1000 to 100 made my day. I will also try
>>> Canopy for initial cluster count.
>>> Unfortunately I don't know how to reduce my 200k dictionary. There are
>>> no
>>> unfrequent terms.
>>> 
>>> I will provide you with Hadoop config shortly. But I am pretty sure I
>>> can't decrease number of Mappers/Reducers per node or increase memory
>>> limits. It will affect the whole cluster.
>>> 
>>> 
>>> Thank you!
>>> 
>>> Pavel
>>> 
>>> 
>>> 08.08.12 16:15 пользователь "Jeff Eastman" <jd...@windwardsolutions.com>
>>> написал:
>>> 
>>>> Consider that each cluster retains 4 vectors in memory in each mapper
>>>> and reducer, and that these vectors tend to become more dense (through
>>>> addition of multiple sparse components) as iterations proceed. With 1000
>>>> clusters and 200k terms in your dictionary this will cause the heap
>>>> space to be consumed rapidly as you have noted. Some times you can work
>>>> around this problem by increasing your heap size on a per-job basis or
>>>> reducing the number of mappers and reducers allowed on each node. Also
>>>> be sure you are not launching reducers until all of your mapper tasks
>>>> have completed.
>>>> 
>>>> In order to provide more help to you, it would be useful to understand
>>>> more about how your cluster is "well tuned". How many mappers & reducers
>>>> are you launching in parallel, the heapspace limits set for tasks on
>>>> each node, etc.
>>>> 
>>>> For a quick test, try reducing the k to 500 or 100 to see how many
>>>> clusters you can reasonably compute with your dataset on your cluster.
>>>> Canopy is also a good way to get a feel for a good initial k, though it
>>>> can be hard to arrive at good T-values in some text clustering cases.
>>>> You, can also try hierarchical clustering with reduced k to stay under
>>>> your memory limits.
>>>> 
>>>> 
>>>> On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>>>>> A stacktrace of error would have helped in finding the exact error.
>>>>> 
>>>>> However, number of clusters can create Heap Space problems ( If the
>>>>> vector dimension is also high ).
>>>>> Either try to reduce the number of initial clusters ( In my opinion,
>>>>> the best way to know about initial clusters is Canopy Clustering
>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering )
>>>>> 
>>>>> or, try to reduce the dimension of the vectors.
>>>>> 
>>>>> PS : you are also providing numClusters twice
>>>>> 
>>>>> --numClusters 1000 \ --numClusters 5 \
>>>>> 
>>>>> On 08-08-2012 10:42, Abramov Pavel wrote:
>>>>>> Hello,
>>>>>> 
>>>>>> I am trying to run KMeans example on 15 000 000 documents (seq2sparse
>>>>>> output).
>>>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms
>>>>>> document size (titles). seq2sparse produces 200 files 80 MB each.
>>>>>> 
>>>>>> My job failed with Java heap space Error. 1st iteration passes while
>>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of
>>>>>> warnings, but it passes. Reduce phase of buildClusters fails with
>>>>>> "Java Heap space".
>>>>>> 
>>>>>> I can not increase reducer/mapper memory in hadoop. My cluster is
>>>>>> tunned well.
>>>>>> 
>>>>>> How can I avoid this situation? My cluster has 300 Mappers and 220
>>>>>> Reducers running 40 servers 8-core 12 GB RAM.
>>>>>> 
>>>>>> Thanks in advance!
>>>>>> 
>>>>>> Here is KMeans parameters:
>>>>>> 
>>>>>> ------------------------------------------------
>>>>>> mahout kmeans -Dmapred.reduce.tasks=200 \
>>>>>> -i ...tfidf-vectors/  \
>>>>>> -o /tmp/clustering_results_kmeans/ \
>>>>>> --clusters /tmp/clusters/ \
>>>>>> --numClusters 1000 \
>>>>>> --numClusters 5 \
>>>>>> --overwrite \
>>>>>> --clustering
>>>>>> ------------------------------------------------
>>>>>> 
>>>>>> Pavel
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Lance Norskog
>> goksron@gmail.com
>

Re: KMeans job fails during 2nd iteration. Java Heap space

Posted by Paritosh Ranjan <pr...@xebia.com>.

1) An outlierThreshold can be provided, then only those vectors which 
have pdf greater than this threshold will be included in the cluster.
Try to play around with this value, this might help.

2) This questions is not that clear. However, I think you are talking 
about the clustering phase. If that is the case, then it won't be 
possible to use the closest point instead of centroid without change in 
the code base.

On 09-08-2012 14:18, Abramov Pavel wrote:
> Hello,
>
> If think Zipf's law is relevant for my data. Thanks for idea about
> memory-mapping.
>
> 1) How can I "drop" extremely small/large clusters? There are 50% small
> clusters with only 1 member while 1 large cluster has 50% members. Is
> there a way to "split" large clusters with Kmeans?
>
> 2) Can I force Mahout not to use exact centroid but the closest point from
> my set? Any point has ~10 non-zero components while exact centroid is very
> dense (~200k).
>
>
> Thanks!
>
> Pavel
>
>
> 09.08.12 5:43 пользователь "Lance Norskog" <go...@gmail.com> написал:
>
>> If Zipf's Law is relevant, locality will be much better than random.
>> Maybe you need a Vector implementation that is backed by memory-mapped
>> files?
>>
>> On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel <p....@rambler-co.ru>
>> wrote:
>>> Thank you Jeff, Paritosh,
>>>
>>> Reducing cluster count from 1000 to 100 made my day. I will also try
>>> Canopy for initial cluster count.
>>> Unfortunately I don't know how to reduce my 200k dictionary. There are
>>> no
>>> unfrequent terms.
>>>
>>> I will provide you with Hadoop config shortly. But I am pretty sure I
>>> can't decrease number of Mappers/Reducers per node or increase memory
>>> limits. It will affect the whole cluster.
>>>
>>>
>>> Thank you!
>>>
>>> Pavel
>>>
>>>
>>> 08.08.12 16:15 пользователь "Jeff Eastman" <jd...@windwardsolutions.com>
>>> написал:
>>>
>>>> Consider that each cluster retains 4 vectors in memory in each mapper
>>>> and reducer, and that these vectors tend to become more dense (through
>>>> addition of multiple sparse components) as iterations proceed. With 1000
>>>> clusters and 200k terms in your dictionary this will cause the heap
>>>> space to be consumed rapidly as you have noted. Some times you can work
>>>> around this problem by increasing your heap size on a per-job basis or
>>>> reducing the number of mappers and reducers allowed on each node. Also
>>>> be sure you are not launching reducers until all of your mapper tasks
>>>> have completed.
>>>>
>>>> In order to provide more help to you, it would be useful to understand
>>>> more about how your cluster is "well tuned". How many mappers & reducers
>>>> are you launching in parallel, the heapspace limits set for tasks on
>>>> each node, etc.
>>>>
>>>> For a quick test, try reducing the k to 500 or 100 to see how many
>>>> clusters you can reasonably compute with your dataset on your cluster.
>>>> Canopy is also a good way to get a feel for a good initial k, though it
>>>> can be hard to arrive at good T-values in some text clustering cases.
>>>> You, can also try hierarchical clustering with reduced k to stay under
>>>> your memory limits.
>>>>
>>>>
>>>> On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>>>>> A stacktrace of error would have helped in finding the exact error.
>>>>>
>>>>> However, number of clusters can create Heap Space problems ( If the
>>>>> vector dimension is also high ).
>>>>> Either try to reduce the number of initial clusters ( In my opinion,
>>>>> the best way to know about initial clusters is Canopy Clustering
>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering )
>>>>>
>>>>> or, try to reduce the dimension of the vectors.
>>>>>
>>>>> PS : you are also providing numClusters twice
>>>>>
>>>>> --numClusters 1000 \ --numClusters 5 \
>>>>>
>>>>> On 08-08-2012 10:42, Abramov Pavel wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I am trying to run KMeans example on 15 000 000 documents (seq2sparse
>>>>>> output).
>>>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms
>>>>>> document size (titles). seq2sparse produces 200 files 80 MB each.
>>>>>>
>>>>>> My job failed with Java heap space Error. 1st iteration passes while
>>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of
>>>>>> warnings, but it passes. Reduce phase of buildClusters fails with
>>>>>> "Java Heap space".
>>>>>>
>>>>>> I can not increase reducer/mapper memory in hadoop. My cluster is
>>>>>> tunned well.
>>>>>>
>>>>>> How can I avoid this situation? My cluster has 300 Mappers and 220
>>>>>> Reducers running 40 servers 8-core 12 GB RAM.
>>>>>>
>>>>>> Thanks in advance!
>>>>>>
>>>>>> Here is KMeans parameters:
>>>>>>
>>>>>> ------------------------------------------------
>>>>>> mahout kmeans -Dmapred.reduce.tasks=200 \
>>>>>> -i ...tfidf-vectors/  \
>>>>>> -o /tmp/clustering_results_kmeans/ \
>>>>>> --clusters /tmp/clusters/ \
>>>>>> --numClusters 1000 \
>>>>>> --numClusters 5 \
>>>>>> --overwrite \
>>>>>> --clustering
>>>>>> ------------------------------------------------
>>>>>>
>>>>>> Pavel
>>>>>
>>>>>
>>>>>
>>
>>
>> -- 
>> Lance Norskog
>> goksron@gmail.com

Re: KMeans job fails during 2nd iteration. Java Heap space

Posted by Abramov Pavel <p....@rambler-co.ru>.

Hello, 

If think Zipf's law is relevant for my data. Thanks for idea about
memory-mapping.

1) How can I "drop" extremely small/large clusters? There are 50% small
clusters with only 1 member while 1 large cluster has 50% members. Is
there a way to "split" large clusters with Kmeans?

2) Can I force Mahout not to use exact centroid but the closest point from
my set? Any point has ~10 non-zero components while exact centroid is very
dense (~200k).


Thanks!

Pavel


09.08.12 5:43 пользователь "Lance Norskog" <go...@gmail.com> написал:

>If Zipf's Law is relevant, locality will be much better than random.
>Maybe you need a Vector implementation that is backed by memory-mapped
>files?
>
>On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel <p....@rambler-co.ru>
>wrote:
>> Thank you Jeff, Paritosh,
>>
>> Reducing cluster count from 1000 to 100 made my day. I will also try
>> Canopy for initial cluster count.
>> Unfortunately I don't know how to reduce my 200k dictionary. There are
>>no
>> unfrequent terms.
>>
>> I will provide you with Hadoop config shortly. But I am pretty sure I
>> can't decrease number of Mappers/Reducers per node or increase memory
>> limits. It will affect the whole cluster.
>>
>>
>> Thank you!
>>
>> Pavel
>>
>>
>> 08.08.12 16:15 пользователь "Jeff Eastman" <jd...@windwardsolutions.com>
>> написал:
>>
>>>Consider that each cluster retains 4 vectors in memory in each mapper
>>>and reducer, and that these vectors tend to become more dense (through
>>>addition of multiple sparse components) as iterations proceed. With 1000
>>>clusters and 200k terms in your dictionary this will cause the heap
>>>space to be consumed rapidly as you have noted. Some times you can work
>>>around this problem by increasing your heap size on a per-job basis or
>>>reducing the number of mappers and reducers allowed on each node. Also
>>>be sure you are not launching reducers until all of your mapper tasks
>>>have completed.
>>>
>>>In order to provide more help to you, it would be useful to understand
>>>more about how your cluster is "well tuned". How many mappers & reducers
>>>are you launching in parallel, the heapspace limits set for tasks on
>>>each node, etc.
>>>
>>>For a quick test, try reducing the k to 500 or 100 to see how many
>>>clusters you can reasonably compute with your dataset on your cluster.
>>>Canopy is also a good way to get a feel for a good initial k, though it
>>>can be hard to arrive at good T-values in some text clustering cases.
>>>You, can also try hierarchical clustering with reduced k to stay under
>>>your memory limits.
>>>
>>>
>>>On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>>>> A stacktrace of error would have helped in finding the exact error.
>>>>
>>>> However, number of clusters can create Heap Space problems ( If the
>>>> vector dimension is also high ).
>>>> Either try to reduce the number of initial clusters ( In my opinion,
>>>> the best way to know about initial clusters is Canopy Clustering
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering )
>>>>
>>>> or, try to reduce the dimension of the vectors.
>>>>
>>>> PS : you are also providing numClusters twice
>>>>
>>>> --numClusters 1000 \ --numClusters 5 \
>>>>
>>>> On 08-08-2012 10:42, Abramov Pavel wrote:
>>>>> Hello,
>>>>>
>>>>> I am trying to run KMeans example on 15 000 000 documents (seq2sparse
>>>>> output).
>>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms
>>>>> document size (titles). seq2sparse produces 200 files 80 MB each.
>>>>>
>>>>> My job failed with Java heap space Error. 1st iteration passes while
>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of
>>>>> warnings, but it passes. Reduce phase of buildClusters fails with
>>>>> "Java Heap space".
>>>>>
>>>>> I can not increase reducer/mapper memory in hadoop. My cluster is
>>>>> tunned well.
>>>>>
>>>>> How can I avoid this situation? My cluster has 300 Mappers and 220
>>>>> Reducers running 40 servers 8-core 12 GB RAM.
>>>>>
>>>>> Thanks in advance!
>>>>>
>>>>> Here is KMeans parameters:
>>>>>
>>>>> ------------------------------------------------
>>>>> mahout kmeans -Dmapred.reduce.tasks=200 \
>>>>> -i ...tfidf-vectors/  \
>>>>> -o /tmp/clustering_results_kmeans/ \
>>>>> --clusters /tmp/clusters/ \
>>>>> --numClusters 1000 \
>>>>> --numClusters 5 \
>>>>> --overwrite \
>>>>> --clustering
>>>>> ------------------------------------------------
>>>>>
>>>>> Pavel
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>
>
>-- 
>Lance Norskog
>goksron@gmail.com

Re: KMeans job fails during 2nd iteration. Java Heap space

Posted by Lance Norskog <go...@gmail.com>.

If Zipf's Law is relevant, locality will be much better than random.
Maybe you need a Vector implementation that is backed by memory-mapped
files?

On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel <p....@rambler-co.ru> wrote:
> Thank you Jeff, Paritosh,
>
> Reducing cluster count from 1000 to 100 made my day. I will also try
> Canopy for initial cluster count.
> Unfortunately I don't know how to reduce my 200k dictionary. There are no
> unfrequent terms.
>
> I will provide you with Hadoop config shortly. But I am pretty sure I
> can't decrease number of Mappers/Reducers per node or increase memory
> limits. It will affect the whole cluster.
>
>
> Thank you!
>
> Pavel
>
>
> 08.08.12 16:15 пользователь "Jeff Eastman" <jd...@windwardsolutions.com>
> написал:
>
>>Consider that each cluster retains 4 vectors in memory in each mapper
>>and reducer, and that these vectors tend to become more dense (through
>>addition of multiple sparse components) as iterations proceed. With 1000
>>clusters and 200k terms in your dictionary this will cause the heap
>>space to be consumed rapidly as you have noted. Some times you can work
>>around this problem by increasing your heap size on a per-job basis or
>>reducing the number of mappers and reducers allowed on each node. Also
>>be sure you are not launching reducers until all of your mapper tasks
>>have completed.
>>
>>In order to provide more help to you, it would be useful to understand
>>more about how your cluster is "well tuned". How many mappers & reducers
>>are you launching in parallel, the heapspace limits set for tasks on
>>each node, etc.
>>
>>For a quick test, try reducing the k to 500 or 100 to see how many
>>clusters you can reasonably compute with your dataset on your cluster.
>>Canopy is also a good way to get a feel for a good initial k, though it
>>can be hard to arrive at good T-values in some text clustering cases.
>>You, can also try hierarchical clustering with reduced k to stay under
>>your memory limits.
>>
>>
>>On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>>> A stacktrace of error would have helped in finding the exact error.
>>>
>>> However, number of clusters can create Heap Space problems ( If the
>>> vector dimension is also high ).
>>> Either try to reduce the number of initial clusters ( In my opinion,
>>> the best way to know about initial clusters is Canopy Clustering
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering )
>>>
>>> or, try to reduce the dimension of the vectors.
>>>
>>> PS : you are also providing numClusters twice
>>>
>>> --numClusters 1000 \ --numClusters 5 \
>>>
>>> On 08-08-2012 10:42, Abramov Pavel wrote:
>>>> Hello,
>>>>
>>>> I am trying to run KMeans example on 15 000 000 documents (seq2sparse
>>>> output).
>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms
>>>> document size (titles). seq2sparse produces 200 files 80 MB each.
>>>>
>>>> My job failed with Java heap space Error. 1st iteration passes while
>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of
>>>> warnings, but it passes. Reduce phase of buildClusters fails with
>>>> "Java Heap space".
>>>>
>>>> I can not increase reducer/mapper memory in hadoop. My cluster is
>>>> tunned well.
>>>>
>>>> How can I avoid this situation? My cluster has 300 Mappers and 220
>>>> Reducers running 40 servers 8-core 12 GB RAM.
>>>>
>>>> Thanks in advance!
>>>>
>>>> Here is KMeans parameters:
>>>>
>>>> ------------------------------------------------
>>>> mahout kmeans -Dmapred.reduce.tasks=200 \
>>>> -i ...tfidf-vectors/  \
>>>> -o /tmp/clustering_results_kmeans/ \
>>>> --clusters /tmp/clusters/ \
>>>> --numClusters 1000 \
>>>> --numClusters 5 \
>>>> --overwrite \
>>>> --clustering
>>>> ------------------------------------------------
>>>>
>>>> Pavel
>>>
>>>
>>>
>>>
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: KMeans job fails during 2nd iteration. Java Heap space

Posted by Abramov Pavel <p....@rambler-co.ru>.

Thank you Jeff, Paritosh,

Reducing cluster count from 1000 to 100 made my day. I will also try
Canopy for initial cluster count.
Unfortunately I don't know how to reduce my 200k dictionary. There are no
unfrequent terms.

I will provide you with Hadoop config shortly. But I am pretty sure I
can't decrease number of Mappers/Reducers per node or increase memory
limits. It will affect the whole cluster.


Thank you!

Pavel


08.08.12 16:15 пользователь "Jeff Eastman" <jd...@windwardsolutions.com>
написал:

>Consider that each cluster retains 4 vectors in memory in each mapper
>and reducer, and that these vectors tend to become more dense (through
>addition of multiple sparse components) as iterations proceed. With 1000
>clusters and 200k terms in your dictionary this will cause the heap
>space to be consumed rapidly as you have noted. Some times you can work
>around this problem by increasing your heap size on a per-job basis or
>reducing the number of mappers and reducers allowed on each node. Also
>be sure you are not launching reducers until all of your mapper tasks
>have completed.
>
>In order to provide more help to you, it would be useful to understand
>more about how your cluster is "well tuned". How many mappers & reducers
>are you launching in parallel, the heapspace limits set for tasks on
>each node, etc.
>
>For a quick test, try reducing the k to 500 or 100 to see how many
>clusters you can reasonably compute with your dataset on your cluster.
>Canopy is also a good way to get a feel for a good initial k, though it
>can be hard to arrive at good T-values in some text clustering cases.
>You, can also try hierarchical clustering with reduced k to stay under
>your memory limits.
>
>
>On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>> A stacktrace of error would have helped in finding the exact error.
>>
>> However, number of clusters can create Heap Space problems ( If the
>> vector dimension is also high ).
>> Either try to reduce the number of initial clusters ( In my opinion,
>> the best way to know about initial clusters is Canopy Clustering
>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering )
>>
>> or, try to reduce the dimension of the vectors.
>>
>> PS : you are also providing numClusters twice
>>
>> --numClusters 1000 \ --numClusters 5 \
>>
>> On 08-08-2012 10:42, Abramov Pavel wrote:
>>> Hello,
>>>
>>> I am trying to run KMeans example on 15 000 000 documents (seq2sparse
>>> output).
>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms
>>> document size (titles). seq2sparse produces 200 files 80 MB each.
>>>
>>> My job failed with Java heap space Error. 1st iteration passes while
>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of
>>> warnings, but it passes. Reduce phase of buildClusters fails with
>>> "Java Heap space".
>>>
>>> I can not increase reducer/mapper memory in hadoop. My cluster is
>>> tunned well.
>>>
>>> How can I avoid this situation? My cluster has 300 Mappers and 220
>>> Reducers running 40 servers 8-core 12 GB RAM.
>>>
>>> Thanks in advance!
>>>
>>> Here is KMeans parameters:
>>>
>>> ------------------------------------------------
>>> mahout kmeans -Dmapred.reduce.tasks=200 \
>>> -i ...tfidf-vectors/  \
>>> -o /tmp/clustering_results_kmeans/ \
>>> --clusters /tmp/clusters/ \
>>> --numClusters 1000 \
>>> --numClusters 5 \
>>> --overwrite \
>>> --clustering
>>> ------------------------------------------------
>>>
>>> Pavel
>>
>>
>>
>>
>

Re: KMeans job fails during 2nd iteration. Java Heap space

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Consider that each cluster retains 4 vectors in memory in each mapper 
and reducer, and that these vectors tend to become more dense (through 
addition of multiple sparse components) as iterations proceed. With 1000 
clusters and 200k terms in your dictionary this will cause the heap 
space to be consumed rapidly as you have noted. Some times you can work 
around this problem by increasing your heap size on a per-job basis or 
reducing the number of mappers and reducers allowed on each node. Also 
be sure you are not launching reducers until all of your mapper tasks 
have completed.

In order to provide more help to you, it would be useful to understand 
more about how your cluster is "well tuned". How many mappers & reducers 
are you launching in parallel, the heapspace limits set for tasks on 
each node, etc.

For a quick test, try reducing the k to 500 or 100 to see how many 
clusters you can reasonably compute with your dataset on your cluster. 
Canopy is also a good way to get a feel for a good initial k, though it 
can be hard to arrive at good T-values in some text clustering cases. 
You, can also try hierarchical clustering with reduced k to stay under 
your memory limits.

On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
> A stacktrace of error would have helped in finding the exact error.
>
> However, number of clusters can create Heap Space problems ( If the 
> vector dimension is also high ).
> Either try to reduce the number of initial clusters ( In my opinion, 
> the best way to know about initial clusters is Canopy Clustering 
> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering )
>
> or, try to reduce the dimension of the vectors.
>
> PS : you are also providing numClusters twice
>
> --numClusters 1000 \ --numClusters 5 \
>
> On 08-08-2012 10:42, Abramov Pavel wrote:
>> Hello,
>>
>> I am trying to run KMeans example on 15 000 000 documents (seq2sparse 
>> output).
>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms 
>> document size (titles). seq2sparse produces 200 files 80 MB each.
>>
>> My job failed with Java heap space Error. 1st iteration passes while 
>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of 
>> warnings, but it passes. Reduce phase of buildClusters fails with 
>> "Java Heap space".
>>
>> I can not increase reducer/mapper memory in hadoop. My cluster is 
>> tunned well.
>>
>> How can I avoid this situation? My cluster has 300 Mappers and 220 
>> Reducers running 40 servers 8-core 12 GB RAM.
>>
>> Thanks in advance!
>>
>> Here is KMeans parameters:
>>
>> ------------------------------------------------
>> mahout kmeans -Dmapred.reduce.tasks=200 \
>> -i ...tfidf-vectors/  \
>> -o /tmp/clustering_results_kmeans/ \
>> --clusters /tmp/clusters/ \
>> --numClusters 1000 \
>> --numClusters 5 \
>> --overwrite \
>> --clustering
>> ------------------------------------------------
>>
>> Pavel
>
>
>
>

Re: KMeans job fails during 2nd iteration. Java Heap space

Posted by Paritosh Ranjan <pr...@xebia.com>.

A stacktrace of error would have helped in finding the exact error.

However, number of clusters can create Heap Space problems ( If the 
vector dimension is also high ).
Either try to reduce the number of initial clusters ( In my opinion, the 
best way to know about initial clusters is Canopy Clustering 
https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering )

or, try to reduce the dimension of the vectors.

PS : you are also providing numClusters twice

--numClusters 1000 \ --numClusters 5 \

On 08-08-2012 10:42, Abramov Pavel wrote:
> Hello,
>
> I am trying to run KMeans example on 15 000 000 documents (seq2sparse output).
> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms document size (titles). seq2sparse produces 200 files 80 MB each.
>
> My job failed with Java heap space Error. 1st iteration passes while 2nd iteration fails. On a Map phase of buildClusters I see a lot of warnings, but it passes. Reduce phase of buildClusters fails with "Java Heap space".
>
> I can not increase reducer/mapper memory in hadoop. My cluster is tunned well.
>
> How can I avoid this situation? My cluster has 300 Mappers and 220 Reducers running 40 servers 8-core 12 GB RAM.
>
> Thanks in advance!
>
> Here is KMeans parameters:
>
> ------------------------------------------------
> mahout kmeans -Dmapred.reduce.tasks=200 \
> -i ...tfidf-vectors/  \
> -o /tmp/clustering_results_kmeans/ \
> --clusters /tmp/clusters/ \
> --numClusters 1000 \
> --numClusters 5 \
> --overwrite \
> --clustering
> ------------------------------------------------
>
> Pavel