You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by beneo_7 <be...@163.com> on 2010/11/23 09:07:13 UTC

canopy map 100% reduce 100%, then memory crazy increase

i use the mahout 4.0 release.

in mahout-distribution-0.4/bin, i used
./mahout canopy -i /home/space/lucene_clustering/vector/vector -o /home/space/lucene_clustering/canopy/ -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -t1 0.8 -t2 0.2 -ow

int hadoop-env.sh, i add the
export HADOOP_HEAPSIZE=20000
export HADOOP_OPTS="-Xmn3g -Xss128k -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=31 -XX:+AggressiveOpts -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9004 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"

i am sure all the parameters are workable, because i use the jconsole to check  the vm paramters.

however, after map 100% and reduce 100%, the memory increase from 2.5G to 20G and the exception thrown. the file vector is 30m, 50000 records, which is used for canopy.

10/11/23 16:04:27 INFO mapred.LocalJobRunner: reduce > reduce
10/11/23 16:04:27 INFO mapred.JobClient:  map 100% reduce 100%
10/11/23 16:04:30 INFO mapred.LocalJobRunner: reduce > reduce
10/11/23 16:08:17 WARN mapred.LocalJobRunner: job_local_0001
java.lang.OutOfMemoryError: Java heap space
    at org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434)
    at org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387)
    at org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:134)
    at org.apache.mahout.math.AbstractVector.assign(AbstractVector.java:449)
    at org.apache.mahout.clustering.AbstractCluster.computeParameters(AbstractCluster.java:184)
    at org.apache.mahout.clustering.canopy.CanopyReducer.reduce(CanopyReducer.java:42)
    at org.apache.mahout.clustering.canopy.CanopyReducer.reduce(CanopyReducer.java:29)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
10/11/23 16:08:18 INFO mapred.JobClient: Job complete: job_local_0001
10/11/23 16:08:18 INFO mapred.JobClient: Counters: 12
10/11/23 16:08:18 INFO mapred.JobClient:   FileSystemCounters
10/11/23 16:08:18 INFO mapred.JobClient:     FILE_BYTES_READ=70413991
10/11/23 16:08:18 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=164338288
10/11/23 16:08:18 INFO mapred.JobClient:   Map-Reduce Framework
10/11/23 16:08:18 INFO mapred.JobClient:     Reduce input groups=1
10/11/23 16:08:18 INFO mapred.JobClient:     Combine output records=0
10/11/23 16:08:18 INFO mapred.JobClient:     Map input records=50000
10/11/23 16:08:18 INFO mapred.JobClient:     Reduce shuffle bytes=0
10/11/23 16:08:18 INFO mapred.JobClient:     Reduce output records=227
10/11/23 16:08:18 INFO mapred.JobClient:     Spilled Records=64708
10/11/23 16:08:18 INFO mapred.JobClient:     Map output bytes=8836211
10/11/23 16:08:18 INFO mapred.JobClient:     Combine input records=0
10/11/23 16:08:18 INFO mapred.JobClient:     Map output records=32354
10/11/23 16:08:18 INFO mapred.JobClient:     Reduce input records=32354
Exception in thread "main" java.lang.InterruptedException: Canopy Job failed processing /home/space/lucene_clustering/vector/vector
    at org.apache.mahout.clustering.canopy.CanopyDriver.buildClustersMR(CanopyDriver.java:252)
    at org.apache.mahout.clustering.canopy.CanopyDriver.buildClusters(CanopyDriver.java:167)
    at org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:114)
    at org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:91)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:58)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)





RE: canopy map 100% reduce 100%, then memory crazy increase

Posted by Jure Jeseničnik <Ju...@planet9.si>.
I probably had the same problem. 
See this thread: 
http://mail-archives.apache.org/mod_mbox/mahout-user/201011.mbox/%3C0EDE11E319B0B043B4F24E0305CABF7C80413134A4@P9MAIL.p9.internal%3E

A short answer would be that this was a bug in the current release and Jeff fixed it this weekend. I tested yesterday and I can confirm that it works now.
Just checkout the latest version and use that instead of the bundled release. It should work.

Here are the instructions
https://cwiki.apache.org/MAHOUT/buildingmahout.html

best regards.

Jure



-----Original Message-----
From: beneo_7 [mailto:beneo_7@163.com] 
Sent: Tuesday, November 23, 2010 9:07 AM
To: user@mahout.apache.org
Subject: canopy map 100% reduce 100%, then memory crazy increase

i use the mahout 4.0 release.

in mahout-distribution-0.4/bin, i used
./mahout canopy -i /home/space/lucene_clustering/vector/vector -o /home/space/lucene_clustering/canopy/ -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -t1 0.8 -t2 0.2 -ow

int hadoop-env.sh, i add the
export HADOOP_HEAPSIZE=20000
export HADOOP_OPTS="-Xmn3g -Xss128k -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=31 -XX:+AggressiveOpts -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9004 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"

i am sure all the parameters are workable, because i use the jconsole to check  the vm paramters.

however, after map 100% and reduce 100%, the memory increase from 2.5G to 20G and the exception thrown. the file vector is 30m, 50000 records, which is used for canopy.

10/11/23 16:04:27 INFO mapred.LocalJobRunner: reduce > reduce
10/11/23 16:04:27 INFO mapred.JobClient:  map 100% reduce 100%
10/11/23 16:04:30 INFO mapred.LocalJobRunner: reduce > reduce
10/11/23 16:08:17 WARN mapred.LocalJobRunner: job_local_0001
java.lang.OutOfMemoryError: Java heap space
    at org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434)
    at org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387)
    at org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:134)
    at org.apache.mahout.math.AbstractVector.assign(AbstractVector.java:449)
    at org.apache.mahout.clustering.AbstractCluster.computeParameters(AbstractCluster.java:184)
    at org.apache.mahout.clustering.canopy.CanopyReducer.reduce(CanopyReducer.java:42)
    at org.apache.mahout.clustering.canopy.CanopyReducer.reduce(CanopyReducer.java:29)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
10/11/23 16:08:18 INFO mapred.JobClient: Job complete: job_local_0001
10/11/23 16:08:18 INFO mapred.JobClient: Counters: 12
10/11/23 16:08:18 INFO mapred.JobClient:   FileSystemCounters
10/11/23 16:08:18 INFO mapred.JobClient:     FILE_BYTES_READ=70413991
10/11/23 16:08:18 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=164338288
10/11/23 16:08:18 INFO mapred.JobClient:   Map-Reduce Framework
10/11/23 16:08:18 INFO mapred.JobClient:     Reduce input groups=1
10/11/23 16:08:18 INFO mapred.JobClient:     Combine output records=0
10/11/23 16:08:18 INFO mapred.JobClient:     Map input records=50000
10/11/23 16:08:18 INFO mapred.JobClient:     Reduce shuffle bytes=0
10/11/23 16:08:18 INFO mapred.JobClient:     Reduce output records=227
10/11/23 16:08:18 INFO mapred.JobClient:     Spilled Records=64708
10/11/23 16:08:18 INFO mapred.JobClient:     Map output bytes=8836211
10/11/23 16:08:18 INFO mapred.JobClient:     Combine input records=0
10/11/23 16:08:18 INFO mapred.JobClient:     Map output records=32354
10/11/23 16:08:18 INFO mapred.JobClient:     Reduce input records=32354
Exception in thread "main" java.lang.InterruptedException: Canopy Job failed processing /home/space/lucene_clustering/vector/vector
    at org.apache.mahout.clustering.canopy.CanopyDriver.buildClustersMR(CanopyDriver.java:252)
    at org.apache.mahout.clustering.canopy.CanopyDriver.buildClusters(CanopyDriver.java:167)
    at org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:114)
    at org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:91)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:58)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)