You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by tuxdna <tu...@gmail.com> on 2014/03/27 15:17:10 UTC
Fuzzy KMeans fails on reuters corpus with 4GB max heap size
I am running Fuzzy KMeans algorithm on Reuters corpus.
I am using Mahout 0.7 on Hadoop 1.1 on Ubuntu 12.04 machine.
Hadoop cluster consists of two machines
* master: 8GB RAM ( 4 cores )
* slave: 4GB RAM ( a KVM vm with only 1 core )
When I run this command, the clustering fails at iteration 3 ( cluster-2 ):
$ mahout fkmeans -cd 1.0 -k 21 -m 2 -ow -x 10 -dm $DISTMETRIC -i
$TFIDF_VEC -c $F_INITCLUSTERS -o $F_CLUSTERS
I see the same error in map tasks ( both at master and slave )
syslog logs
2014-03-27 17:01:42,598 INFO org.apache.hadoop.util.NativeCodeLoader:
Loaded the native-hadoop library
2014-03-27 17:01:42,807 WARN
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
already exists!
2014-03-27 17:01:42,871 INFO org.apache.hadoop.util.ProcessTree:
setsid exited with exit code 0
2014-03-27 17:01:42,873 INFO org.apache.hadoop.mapred.Task: Using
ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4d7c07
2014-03-27 17:01:42,944 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2014-03-27 17:01:42,969 INFO org.apache.hadoop.mapred.MapTask: data
buffer = 79691776/99614720
2014-03-27 17:01:42,969 INFO org.apache.hadoop.mapred.MapTask: record
buffer = 262144/327680
2014-03-27 17:01:43,640 INFO
org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs'
truncater with mapRetainSize=-1 and reduceRetainSize=-1
2014-03-27 17:01:43,658 INFO org.apache.hadoop.io.nativeio.NativeIO:
Initialized cache for UID to User mapping with a cache timeout of
14400 seconds.
2014-03-27 17:01:43,658 INFO org.apache.hadoop.io.nativeio.NativeIO:
Got UserName hduser for UID 1002 from the native implementation
2014-03-27 17:01:43,660 FATAL org.apache.hadoop.mapred.Child: Error
running child : java.lang.OutOfMemoryError: Java heap space
at org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434)
at org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387)
at org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139)
at org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:118)
at org.apache.mahout.math.VectorWritable.readVector(VectorWritable.java:190)
at org.apache.mahout.clustering.AbstractCluster.readFields(AbstractCluster.java:99)
at org.apache.mahout.clustering.iterator.DistanceMeasureCluster.readFields(DistanceMeasureCluster.java:55)
at org.apache.mahout.clustering.kmeans.Kluster.readFields(Kluster.java:72)
at org.apache.mahout.classifier.sgd.PolymorphicWritable.read(PolymorphicWritable.java:43)
at org.apache.mahout.clustering.iterator.ClusterWritable.readFields(ClusterWritable.java:46)
at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1941)
at org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:76)
at org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:35)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
at com.google.common.collect.Iterators$5.hasNext(Iterators.java:525)
at com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
at org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(ClusterClassifier.java:208)
at org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:36)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
I have tried to set maximum heap size to 4000MB in both master and
slave machines ( in bin/hadoop ) :
JAVA_HEAP_MAX=-Xmx4000m
However still I see the same error as mentioned above.
What else could I do to avoid the problem ?
Another question is that, whether or not can this be resolved using a
later version of Mahout.
I have put complete list of commands used to arrive at this error are
present in this Gist:
* https://gist.github.com/tuxdna/9808278
Output of Mahout fkmeans is here:
* http://fpaste.org/89169/
And the task tracker logs are located here:
* http://fpaste.org/89166/
Thanks and regards,
Saleem
Re: Fuzzy KMeans fails on reuters corpus with 4GB max heap size
Posted by tuxdna <tu...@gmail.com>.
>
> What else could I do to avoid the problem ?
>
> Another question is that, whether or not can this be resolved using a
> later version of Mahout.
>
I ran the same example with Mahout 0.9 and it works fine for me.
Regards,
Saleem