You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Vitaly Davydov <vd...@mirantis.com> on 2011/10/31 12:20:46 UTC

ClusterDumper issue

Hello,
We use mahout Kmeans-algorithm and convert its binary output to text
representation via ClusterDumper. When our input has reached approximately
20 million points, ClusterDumper takes 2.5 Gb RAM and fails with "Out of
memory error". Our machines don't have swap and we can't increase RAM
currently. Is there a way to avoid this problem?

The exception is attached below:

"Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:101)
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
at com.google.common.collect.Iterators$5.hasNext(Iterators.java:474)
at
com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:39)
at
org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:239)
at
org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:193)
at
org.apache.mahout.utils.clustering.ClusterDumper.<init>(ClusterDumper.java:78)
at com.mirantis.bigdata.clustering.kmeans.KmeansJob.run(Unknown Source)
at com.mirantis.bigdata.clustering.kmeans.KmeansJob.main(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)"

-- 
Regards,
Vitaly Davydov

Re: ClusterDumper issue

Posted by Ted Dunning <te...@gmail.com>.

For visualization, you can also do sampling to keep no more than, say, 1000
points per cluster.  If you also remember what fraction of the points you
have kept, you can plot the points with different transparency to get a
good visual rendition of the clusters.

On Mon, Oct 31, 2011 at 9:17 AM, Jeff Eastman <je...@narus.com> wrote:

> Unfortunately, the cluster dumper loads all the points into memory so it
> can sort them by cluster for display. What are you trying to do with the
> 20M points? Certainly not display them! A better step for subsequent
> processing would be for you to write a short MR program to read in the
> clusteredPoints directory and output each point to its clustered (mapper).
> Then, using k reducers, each of which will get the points for one cluster,
> each reducer will output a folder containing all the points for that
> cluster.
>
> -----Original Message-----
> From: Vitaly Davydov [mailto:vdavydov@mirantis.com]
> Sent: Monday, October 31, 2011 4:21 AM
> To: user@mahout.apache.org
> Subject: ClusterDumper issue
>
> Hello,
> We use mahout Kmeans-algorithm and convert its binary output to text
> representation via ClusterDumper. When our input has reached approximately
> 20 million points, ClusterDumper takes 2.5 Gb RAM and fails with "Out of
> memory error". Our machines don't have swap and we can't increase RAM
> currently. Is there a way to avoid this problem?
>
> The exception is attached below:
>
> "Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at
>
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:101)
> at
>
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
> at
>
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
> at
>
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
> at com.google.common.collect.Iterators$5.hasNext(Iterators.java:474)
> at
>
> com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:39)
> at
>
> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:239)
> at
>
> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:193)
> at
>
> org.apache.mahout.utils.clustering.ClusterDumper.<init>(ClusterDumper.java:78)
> at com.mirantis.bigdata.clustering.kmeans.KmeansJob.run(Unknown Source)
> at com.mirantis.bigdata.clustering.kmeans.KmeansJob.main(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:186)"
>
> --
> Regards,
> Vitaly Davydov
>

RE: ClusterDumper issue

Posted by Jeff Eastman <je...@Narus.com>.

Unfortunately, the cluster dumper loads all the points into memory so it can sort them by cluster for display. What are you trying to do with the 20M points? Certainly not display them! A better step for subsequent processing would be for you to write a short MR program to read in the clusteredPoints directory and output each point to its clustered (mapper). Then, using k reducers, each of which will get the points for one cluster, each reducer will output a folder containing all the points for that cluster.

-----Original Message-----
From: Vitaly Davydov [mailto:vdavydov@mirantis.com] 
Sent: Monday, October 31, 2011 4:21 AM
To: user@mahout.apache.org
Subject: ClusterDumper issue

Hello,
We use mahout Kmeans-algorithm and convert its binary output to text
representation via ClusterDumper. When our input has reached approximately
20 million points, ClusterDumper takes 2.5 Gb RAM and fails with "Out of
memory error". Our machines don't have swap and we can't increase RAM
currently. Is there a way to avoid this problem?

The exception is attached below:

"Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:101)
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
at com.google.common.collect.Iterators$5.hasNext(Iterators.java:474)
at
com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:39)
at
org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:239)
at
org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:193)
at
org.apache.mahout.utils.clustering.ClusterDumper.<init>(ClusterDumper.java:78)
at com.mirantis.bigdata.clustering.kmeans.KmeansJob.run(Unknown Source)
at com.mirantis.bigdata.clustering.kmeans.KmeansJob.main(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)"

-- 
Regards,
Vitaly Davydov