You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sebastian Briesemeister <se...@unister-gmbh.de> on 2013/03/28 17:26:46 UTC

Fuzyy Clustering accumulates lots of memory

Dear all,

I have a large dataset consisting of ~50,000 documents and a dimension
of 90,000. I splitted the created input vectors in smaller files to run
a single mapper task on each of the files.
However, even with very small files containing only 50 documents, I run
into heap space problems.

I tried to debug the problem and started the FuzzyKMeansDriver in local
mode in my IDE. Interestingly, it is already the first mapper task that
accumulates very quickly more than 4GB.
In class CIMapper the method map(..) gets called by class Mapper for
each input vector of the input split file. Either Mapper or CIMapper is
responsible for the memory consumption, but I could not see where and
why it could accumulate memory since no additional data is saved during
the mapping process.
I thought maybe the SoftCluster objects require that much, but since
each of them contains 4 dense vectors of double (8 byte) of size 90,000
and I have 500 clusters, they only sum up to 1,34 GB...so where are the
missing GBs?

Does anyone has an explanation for this behaviour or has experience with
memory problems on large scale clustering?

Thanks in advance
Sebastian

Re: Fuzyy Clustering accumulates lots of memory

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Fuzzy KMeans will use a lot of heap memory because every vector is 
observed (with weighting) by every cluster. This will make the cluster 
centers (and other vectors) much more dense than with any of the other 
clustering algorithms. Figure you are storing 90k doubles in each vector 
and each cluster has 4 vectors (center, radius, s1 & s2) that will all 
become very dense.

KMeans should perform better since each vector is only observed by a 
single cluster and the cluster vectors are more likely to remain fairly 
sparse.

Dirichlet will likely end up somewhere in the middle.

On 3/28/13 12:46 PM, Sebastian Briesemeister wrote:
> I tried increasing the child heap size. But as I mentioned even 4GB
> wasn't enough.
>
> I am also not sure whether the block size has some influence on the
> memory, but I assume this is not the case since such a design would be
> really bad.
>
> Any other ideas?
>
>
> Am 28.03.2013 17:40, schrieb Chris Harrington:
>> Don't know if this will help with your heap issues (or if you've already tried it) but increasing the mapred.child.java.opts in the mapred-site.xml resolved some heap issues I was having. I was clustering 67000 small text docs into ~180 clusters and was seeing mapper heap issues until I made this change.
>>
>> 	<property>
>> 		<name>mapred.child.java.opts</name>
>> 		<value>-Xmx1024M</value>
>> 	</property>
>>
>> Someone please correct me if I'm wrong but I think the mapper gets kicked off as a child (i.e. in it's own jvm) which is why increasing hadoop's heap size doesn't do anything but increasing the mapred.child.java.opts might help.
>>
>> Once again correct me if I'm wrong but the cause may be due to hadoop's block size of 64mb so even a small file takes up more this amount of space or something like that I couldn't quite wrap my head around some of the stuff I read on the topic.
>>
>> On 28 Mar 2013, at 16:26, Sebastian Briesemeister wrote:
>>
>>> Dear all,
>>>
>>> I have a large dataset consisting of ~50,000 documents and a dimension
>>> of 90,000. I splitted the created input vectors in smaller files to run
>>> a single mapper task on each of the files.
>>> However, even with very small files containing only 50 documents, I run
>>> into heap space problems.
>>>
>>> I tried to debug the problem and started the FuzzyKMeansDriver in local
>>> mode in my IDE. Interestingly, it is already the first mapper task that
>>> accumulates very quickly more than 4GB.
>>> In class CIMapper the method map(..) gets called by class Mapper for
>>> each input vector of the input split file. Either Mapper or CIMapper is
>>> responsible for the memory consumption, but I could not see where and
>>> why it could accumulate memory since no additional data is saved during
>>> the mapping process.
>>> I thought maybe the SoftCluster objects require that much, but since
>>> each of them contains 4 dense vectors of double (8 byte) of size 90,000
>>> and I have 500 clusters, they only sum up to 1,34 GB...so where are the
>>> missing GBs?
>>>
>>> Does anyone has an explanation for this behaviour or has experience with
>>> memory problems on large scale clustering?
>>>
>>> Thanks in advance
>>> Sebastian
>
>


Re: Fuzyy Clustering accumulates lots of memory

Posted by Sebastian Briesemeister <se...@unister-gmbh.de>.
I tried increasing the child heap size. But as I mentioned even 4GB
wasn't enough.

I am also not sure whether the block size has some influence on the
memory, but I assume this is not the case since such a design would be
really bad.

Any other ideas?


Am 28.03.2013 17:40, schrieb Chris Harrington:
> Don't know if this will help with your heap issues (or if you've already tried it) but increasing the mapred.child.java.opts in the mapred-site.xml resolved some heap issues I was having. I was clustering 67000 small text docs into ~180 clusters and was seeing mapper heap issues until I made this change. 
>
> 	<property>
> 		<name>mapred.child.java.opts</name>
> 		<value>-Xmx1024M</value>
> 	</property>
>
> Someone please correct me if I'm wrong but I think the mapper gets kicked off as a child (i.e. in it's own jvm) which is why increasing hadoop's heap size doesn't do anything but increasing the mapred.child.java.opts might help.
>
> Once again correct me if I'm wrong but the cause may be due to hadoop's block size of 64mb so even a small file takes up more this amount of space or something like that I couldn't quite wrap my head around some of the stuff I read on the topic.
>
> On 28 Mar 2013, at 16:26, Sebastian Briesemeister wrote:
>
>> Dear all,
>>
>> I have a large dataset consisting of ~50,000 documents and a dimension
>> of 90,000. I splitted the created input vectors in smaller files to run
>> a single mapper task on each of the files.
>> However, even with very small files containing only 50 documents, I run
>> into heap space problems.
>>
>> I tried to debug the problem and started the FuzzyKMeansDriver in local
>> mode in my IDE. Interestingly, it is already the first mapper task that
>> accumulates very quickly more than 4GB.
>> In class CIMapper the method map(..) gets called by class Mapper for
>> each input vector of the input split file. Either Mapper or CIMapper is
>> responsible for the memory consumption, but I could not see where and
>> why it could accumulate memory since no additional data is saved during
>> the mapping process.
>> I thought maybe the SoftCluster objects require that much, but since
>> each of them contains 4 dense vectors of double (8 byte) of size 90,000
>> and I have 500 clusters, they only sum up to 1,34 GB...so where are the
>> missing GBs?
>>
>> Does anyone has an explanation for this behaviour or has experience with
>> memory problems on large scale clustering?
>>
>> Thanks in advance
>> Sebastian


Re: Fuzyy Clustering accumulates lots of memory

Posted by Chris Harrington <ch...@heystaks.com>.
Don't know if this will help with your heap issues (or if you've already tried it) but increasing the mapred.child.java.opts in the mapred-site.xml resolved some heap issues I was having. I was clustering 67000 small text docs into ~180 clusters and was seeing mapper heap issues until I made this change. 

	<property>
		<name>mapred.child.java.opts</name>
		<value>-Xmx1024M</value>
	</property>

Someone please correct me if I'm wrong but I think the mapper gets kicked off as a child (i.e. in it's own jvm) which is why increasing hadoop's heap size doesn't do anything but increasing the mapred.child.java.opts might help.

Once again correct me if I'm wrong but the cause may be due to hadoop's block size of 64mb so even a small file takes up more this amount of space or something like that I couldn't quite wrap my head around some of the stuff I read on the topic.

On 28 Mar 2013, at 16:26, Sebastian Briesemeister wrote:

> Dear all,
> 
> I have a large dataset consisting of ~50,000 documents and a dimension
> of 90,000. I splitted the created input vectors in smaller files to run
> a single mapper task on each of the files.
> However, even with very small files containing only 50 documents, I run
> into heap space problems.
> 
> I tried to debug the problem and started the FuzzyKMeansDriver in local
> mode in my IDE. Interestingly, it is already the first mapper task that
> accumulates very quickly more than 4GB.
> In class CIMapper the method map(..) gets called by class Mapper for
> each input vector of the input split file. Either Mapper or CIMapper is
> responsible for the memory consumption, but I could not see where and
> why it could accumulate memory since no additional data is saved during
> the mapping process.
> I thought maybe the SoftCluster objects require that much, but since
> each of them contains 4 dense vectors of double (8 byte) of size 90,000
> and I have 500 clusters, they only sum up to 1,34 GB...so where are the
> missing GBs?
> 
> Does anyone has an explanation for this behaviour or has experience with
> memory problems on large scale clustering?
> 
> Thanks in advance
> Sebastian