You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Adam Baron <ad...@gmail.com> on 2013/01/03 21:23:09 UTC

Memory Requirements of Naïve Bayes?

I'm trying to run Naïve Bayes on 2.4GB of tfidf-vectors representing a
bunch of 1-, 2-, 3-grams.  However, no matter how much I increase the
mapred.child.java.opts, I seem to get "java.lang.OutOfMemoryError: Java
heap space" errors.  My most recent attempt before e-mailing this mail
group was 32GB for mapred.child.java.opts and 33GB for mapred.child.ulimit.

I'm using these "mahout trainnb" with these arguments:
-i [my tfidf-vectors directory on HDFS]
-el
-o [name of a model file that does not yet exist, in an HDFS directory that
does exist]
-li [name of a label index file that does not yet exist, in an HDFS
directory that does exist]
-ow

Any idea what I can try to get this to work?  I don't think I fancy going
above 32GB for a 2.4GB input file.  Below is the output when I run the
command:

13/01/03 14:08:43 INFO common.HadoopUtil: Deleting temp
13/01/03 14:09:31 INFO input.FileInputFormat: Total input paths to process
: 1
13/01/03 14:09:32 INFO mapred.JobClient: Running job: job_201211120903_15452
13/01/03 14:09:33 INFO mapred.JobClient:  map 0% reduce 0%
13/01/03 14:09:44 INFO mapred.JobClient:  map 51% reduce 0%
13/01/03 14:09:45 INFO mapred.JobClient:  map 71% reduce 0%
13/01/03 14:09:47 INFO mapred.JobClient:  map 88% reduce 0%
13/01/03 14:09:48 INFO mapred.JobClient:  map 99% reduce 0%
13/01/03 14:09:52 INFO mapred.JobClient:  map 100% reduce 0%
13/01/03 14:09:59 INFO mapred.JobClient:  map 100% reduce 5%
13/01/03 14:10:02 INFO mapred.JobClient:  map 100% reduce 31%
13/01/03 14:10:05 INFO mapred.JobClient:  map 100% reduce 33%
13/01/03 14:10:08 INFO mapred.JobClient:  map 100% reduce 75%
13/01/03 14:10:11 INFO mapred.JobClient:  map 100% reduce 78%
13/01/03 14:10:15 INFO mapred.JobClient:  map 100% reduce 82%
13/01/03 14:10:17 INFO mapred.JobClient:  map 100% reduce 89%
13/01/03 14:10:20 INFO mapred.JobClient:  map 100% reduce 95%
13/01/03 14:10:23 INFO mapred.JobClient:  map 100% reduce 100%
13/01/03 14:10:29 INFO mapred.JobClient: Job complete:
job_201211120903_15452
13/01/03 14:10:29 INFO mapred.JobClient: Counters: 22
13/01/03 14:10:29 INFO mapred.JobClient:   Job Counters
13/01/03 14:10:29 INFO mapred.JobClient:     Launched reduce tasks=1
13/01/03 14:10:29 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=258302
13/01/03 14:10:29 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
13/01/03 14:10:29 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
13/01/03 14:10:29 INFO mapred.JobClient:     Launched map tasks=19
13/01/03 14:10:29 INFO mapred.JobClient:     Data-local map tasks=19
13/01/03 14:10:29 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=36375
13/01/03 14:10:29 INFO mapred.JobClient:   FileSystemCounters
13/01/03 14:10:29 INFO mapred.JobClient:     FILE_BYTES_READ=306924353
13/01/03 14:10:29 INFO mapred.JobClient:     HDFS_BYTES_READ=2545107495
13/01/03 14:10:29 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=614908308
13/01/03 14:10:29 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=217788513
13/01/03 14:10:29 INFO mapred.JobClient:   Map-Reduce Framework
13/01/03 14:10:29 INFO mapred.JobClient:     Reduce input groups=2
13/01/03 14:10:29 INFO mapred.JobClient:     Combine output records=20
13/01/03 14:10:29 INFO mapred.JobClient:     Map input records=370867
13/01/03 14:10:29 INFO mapred.JobClient:     Reduce shuffle bytes=290705921
13/01/03 14:10:29 INFO mapred.JobClient:     Reduce output records=2
13/01/03 14:10:29 INFO mapred.JobClient:     Spilled Records=40
13/01/03 14:10:29 INFO mapred.JobClient:     Map output bytes=2524521040
13/01/03 14:10:29 INFO mapred.JobClient:     Combine input records=370867
13/01/03 14:10:29 INFO mapred.JobClient:     Map output records=370867
13/01/03 14:10:29 INFO mapred.JobClient:     SPLIT_RAW_BYTES=3458
13/01/03 14:10:29 INFO mapred.JobClient:     Reduce input records=20
13/01/03 14:10:29 INFO input.FileInputFormat: Total input paths to process
: 1
13/01/03 14:10:29 INFO mapred.JobClient: Running job: job_201211120903_15453
13/01/03 14:10:30 INFO mapred.JobClient:  map 0% reduce 0%
13/01/03 14:10:45 INFO mapred.JobClient:  map 50% reduce 0%
13/01/03 14:10:47 INFO mapred.JobClient:  map 100% reduce 0%
13/01/03 14:11:04 INFO mapred.JobClient:  map 100% reduce 16%
13/01/03 14:11:07 INFO mapred.JobClient:  map 100% reduce 33%
13/01/03 14:11:10 INFO mapred.JobClient:  map 100% reduce 100%
13/01/03 14:11:18 INFO mapred.JobClient: Job complete:
job_201211120903_15453
13/01/03 14:11:18 INFO mapred.JobClient: Counters: 22
13/01/03 14:11:18 INFO mapred.JobClient:   Job Counters
13/01/03 14:11:18 INFO mapred.JobClient:     Launched reduce tasks=1
13/01/03 14:11:18 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=36791
13/01/03 14:11:18 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
13/01/03 14:11:18 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
13/01/03 14:11:18 INFO mapred.JobClient:     Launched map tasks=2
13/01/03 14:11:18 INFO mapred.JobClient:     Data-local map tasks=2
13/01/03 14:11:18 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=20671
13/01/03 14:11:18 INFO mapred.JobClient:   FileSystemCounters
13/01/03 14:11:18 INFO mapred.JobClient:     FILE_BYTES_READ=202961723
13/01/03 14:11:18 INFO mapred.JobClient:     HDFS_BYTES_READ=301359707
13/01/03 14:11:18 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=308381584
13/01/03 14:11:18 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=205579891
13/01/03 14:11:18 INFO mapred.JobClient:   Map-Reduce Framework
13/01/03 14:11:18 INFO mapred.JobClient:     Reduce input groups=2
13/01/03 14:11:18 INFO mapred.JobClient:     Combine output records=4
13/01/03 14:11:18 INFO mapred.JobClient:     Map input records=2
13/01/03 14:11:18 INFO mapred.JobClient:     Reduce shuffle bytes=7559204
13/01/03 14:11:18 INFO mapred.JobClient:     Reduce output records=2
13/01/03 14:11:18 INFO mapred.JobClient:     Spilled Records=10
13/01/03 14:11:18 INFO mapred.JobClient:     Map output bytes=217788354
13/01/03 14:11:18 INFO mapred.JobClient:     Combine input records=4
13/01/03 14:11:18 INFO mapred.JobClient:     Map output records=4
13/01/03 14:11:18 INFO mapred.JobClient:     SPLIT_RAW_BYTES=296
13/01/03 14:11:18 INFO mapred.JobClient:     Reduce input records=4
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at
org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434)
        at
org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387)
        at
org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139)
        at
org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:118)
        at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1766)
        at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1894)
        at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
        at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
        at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
        at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
        at com.google.common.collect.Iterators$5.hasNext(Iterators.java:525)
        at
com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
        at
org.apache.mahout.classifier.naivebayes.BayesUtils.readModelFromDir(BayesUtils.java:61)
        at
org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:137)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at
org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.main(TrainNaiveBayesJob.java:62)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)

Thanks,
            Adam

PS: I was able to run the classify-20newsgroups.sh example packaged in
Mahout 0.7 needing only to increase my mapred.child.java.opts to 2GB (since
it had similar errors at 1GB).

Re: Memory Requirements of Naïve Bayes?

Posted by Robin Anil <ro...@gmail.com>.
Use seq2encoded instead to create smaller vectors. See the other thread.


Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Thu, Jan 3, 2013 at 3:47 PM, Robin Anil <ro...@gmail.com> wrote:

> Model is bounded by the feature space. So if you are using uptil trigrams,
> you need to estimate the memory needed, Assume total doubles needed =
>
> IIR vaguely, its Num classes * num features * 12/16 bytes.
>
> See if you can actually build a model with that. Else, I would suggest
> pruning features from the input vectors like those with < 5 frequency.
>
> Robin
>
>  Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Thu, Jan 3, 2013 at 2:23 PM, Adam Baron <ad...@gmail.com> wrote:
>
>> I'm trying to run Naïve Bayes on 2.4GB of tfidf-vectors representing a
>> bunch of 1-, 2-, 3-grams.  However, no matter how much I increase the
>> mapred.child.java.opts, I seem to get "java.lang.OutOfMemoryError: Java
>> heap space" errors.  My most recent attempt before e-mailing this mail
>> group was 32GB for mapred.child.java.opts and 33GB for
>> mapred.child.ulimit.
>>
>> I'm using these "mahout trainnb" with these arguments:
>> -i [my tfidf-vectors directory on HDFS]
>> -el
>> -o [name of a model file that does not yet exist, in an HDFS directory
>> that
>> does exist]
>> -li [name of a label index file that does not yet exist, in an HDFS
>> directory that does exist]
>> -ow
>>
>> Any idea what I can try to get this to work?  I don't think I fancy going
>> above 32GB for a 2.4GB input file.  Below is the output when I run the
>> command:
>>
>> 13/01/03 14:08:43 INFO common.HadoopUtil: Deleting temp
>> 13/01/03 14:09:31 INFO input.FileInputFormat: Total input paths to process
>> : 1
>> 13/01/03 14:09:32 INFO mapred.JobClient: Running job:
>> job_201211120903_15452
>> 13/01/03 14:09:33 INFO mapred.JobClient:  map 0% reduce 0%
>> 13/01/03 14:09:44 INFO mapred.JobClient:  map 51% reduce 0%
>> 13/01/03 14:09:45 INFO mapred.JobClient:  map 71% reduce 0%
>> 13/01/03 14:09:47 INFO mapred.JobClient:  map 88% reduce 0%
>> 13/01/03 14:09:48 INFO mapred.JobClient:  map 99% reduce 0%
>> 13/01/03 14:09:52 INFO mapred.JobClient:  map 100% reduce 0%
>> 13/01/03 14:09:59 INFO mapred.JobClient:  map 100% reduce 5%
>> 13/01/03 14:10:02 INFO mapred.JobClient:  map 100% reduce 31%
>> 13/01/03 14:10:05 INFO mapred.JobClient:  map 100% reduce 33%
>> 13/01/03 14:10:08 INFO mapred.JobClient:  map 100% reduce 75%
>> 13/01/03 14:10:11 INFO mapred.JobClient:  map 100% reduce 78%
>> 13/01/03 14:10:15 INFO mapred.JobClient:  map 100% reduce 82%
>> 13/01/03 14:10:17 INFO mapred.JobClient:  map 100% reduce 89%
>> 13/01/03 14:10:20 INFO mapred.JobClient:  map 100% reduce 95%
>> 13/01/03 14:10:23 INFO mapred.JobClient:  map 100% reduce 100%
>> 13/01/03 14:10:29 INFO mapred.JobClient: Job complete:
>> job_201211120903_15452
>> 13/01/03 14:10:29 INFO mapred.JobClient: Counters: 22
>> 13/01/03 14:10:29 INFO mapred.JobClient:   Job Counters
>> 13/01/03 14:10:29 INFO mapred.JobClient:     Launched reduce tasks=1
>> 13/01/03 14:10:29 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=258302
>> 13/01/03 14:10:29 INFO mapred.JobClient:     Total time spent by all
>> reduces waiting after reserving slots (ms)=0
>> 13/01/03 14:10:29 INFO mapred.JobClient:     Total time spent by all maps
>> waiting after reserving slots (ms)=0
>> 13/01/03 14:10:29 INFO mapred.JobClient:     Launched map tasks=19
>> 13/01/03 14:10:29 INFO mapred.JobClient:     Data-local map tasks=19
>> 13/01/03 14:10:29 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=36375
>> 13/01/03 14:10:29 INFO mapred.JobClient:   FileSystemCounters
>> 13/01/03 14:10:29 INFO mapred.JobClient:     FILE_BYTES_READ=306924353
>> 13/01/03 14:10:29 INFO mapred.JobClient:     HDFS_BYTES_READ=2545107495
>> 13/01/03 14:10:29 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=614908308
>> 13/01/03 14:10:29 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=217788513
>> 13/01/03 14:10:29 INFO mapred.JobClient:   Map-Reduce Framework
>> 13/01/03 14:10:29 INFO mapred.JobClient:     Reduce input groups=2
>> 13/01/03 14:10:29 INFO mapred.JobClient:     Combine output records=20
>> 13/01/03 14:10:29 INFO mapred.JobClient:     Map input records=370867
>> 13/01/03 14:10:29 INFO mapred.JobClient:     Reduce shuffle
>> bytes=290705921
>> 13/01/03 14:10:29 INFO mapred.JobClient:     Reduce output records=2
>> 13/01/03 14:10:29 INFO mapred.JobClient:     Spilled Records=40
>> 13/01/03 14:10:29 INFO mapred.JobClient:     Map output bytes=2524521040
>> 13/01/03 14:10:29 INFO mapred.JobClient:     Combine input records=370867
>> 13/01/03 14:10:29 INFO mapred.JobClient:     Map output records=370867
>> 13/01/03 14:10:29 INFO mapred.JobClient:     SPLIT_RAW_BYTES=3458
>> 13/01/03 14:10:29 INFO mapred.JobClient:     Reduce input records=20
>> 13/01/03 14:10:29 INFO input.FileInputFormat: Total input paths to process
>> : 1
>> 13/01/03 14:10:29 INFO mapred.JobClient: Running job:
>> job_201211120903_15453
>> 13/01/03 14:10:30 INFO mapred.JobClient:  map 0% reduce 0%
>> 13/01/03 14:10:45 INFO mapred.JobClient:  map 50% reduce 0%
>> 13/01/03 14:10:47 INFO mapred.JobClient:  map 100% reduce 0%
>> 13/01/03 14:11:04 INFO mapred.JobClient:  map 100% reduce 16%
>> 13/01/03 14:11:07 INFO mapred.JobClient:  map 100% reduce 33%
>> 13/01/03 14:11:10 INFO mapred.JobClient:  map 100% reduce 100%
>> 13/01/03 14:11:18 INFO mapred.JobClient: Job complete:
>> job_201211120903_15453
>> 13/01/03 14:11:18 INFO mapred.JobClient: Counters: 22
>> 13/01/03 14:11:18 INFO mapred.JobClient:   Job Counters
>> 13/01/03 14:11:18 INFO mapred.JobClient:     Launched reduce tasks=1
>> 13/01/03 14:11:18 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=36791
>> 13/01/03 14:11:18 INFO mapred.JobClient:     Total time spent by all
>> reduces waiting after reserving slots (ms)=0
>> 13/01/03 14:11:18 INFO mapred.JobClient:     Total time spent by all maps
>> waiting after reserving slots (ms)=0
>> 13/01/03 14:11:18 INFO mapred.JobClient:     Launched map tasks=2
>> 13/01/03 14:11:18 INFO mapred.JobClient:     Data-local map tasks=2
>> 13/01/03 14:11:18 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=20671
>> 13/01/03 14:11:18 INFO mapred.JobClient:   FileSystemCounters
>> 13/01/03 14:11:18 INFO mapred.JobClient:     FILE_BYTES_READ=202961723
>> 13/01/03 14:11:18 INFO mapred.JobClient:     HDFS_BYTES_READ=301359707
>> 13/01/03 14:11:18 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=308381584
>> 13/01/03 14:11:18 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=205579891
>> 13/01/03 14:11:18 INFO mapred.JobClient:   Map-Reduce Framework
>> 13/01/03 14:11:18 INFO mapred.JobClient:     Reduce input groups=2
>> 13/01/03 14:11:18 INFO mapred.JobClient:     Combine output records=4
>> 13/01/03 14:11:18 INFO mapred.JobClient:     Map input records=2
>> 13/01/03 14:11:18 INFO mapred.JobClient:     Reduce shuffle bytes=7559204
>> 13/01/03 14:11:18 INFO mapred.JobClient:     Reduce output records=2
>> 13/01/03 14:11:18 INFO mapred.JobClient:     Spilled Records=10
>> 13/01/03 14:11:18 INFO mapred.JobClient:     Map output bytes=217788354
>> 13/01/03 14:11:18 INFO mapred.JobClient:     Combine input records=4
>> 13/01/03 14:11:18 INFO mapred.JobClient:     Map output records=4
>> 13/01/03 14:11:18 INFO mapred.JobClient:     SPLIT_RAW_BYTES=296
>> 13/01/03 14:11:18 INFO mapred.JobClient:     Reduce input records=4
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>         at
>>
>> org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434)
>>         at
>>
>> org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387)
>>         at
>>
>> org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139)
>>         at
>> org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:118)
>>         at
>>
>> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1766)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1894)
>>         at
>>
>> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
>>         at
>>
>> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
>>         at
>>
>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
>>         at
>>
>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
>>         at
>> com.google.common.collect.Iterators$5.hasNext(Iterators.java:525)
>>         at
>>
>> com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
>>         at
>>
>> org.apache.mahout.classifier.naivebayes.BayesUtils.readModelFromDir(BayesUtils.java:61)
>>         at
>>
>> org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:137)
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>         at
>>
>> org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.main(TrainNaiveBayesJob.java:62)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>         at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at
>>
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>         at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>         at
>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>         at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>
>> Thanks,
>>             Adam
>>
>> PS: I was able to run the classify-20newsgroups.sh example packaged in
>> Mahout 0.7 needing only to increase my mapred.child.java.opts to 2GB
>> (since
>> it had similar errors at 1GB).
>>
>
>

Re: Memory Requirements of Naïve Bayes?

Posted by Robin Anil <ro...@gmail.com>.
Model is bounded by the feature space. So if you are using uptil trigrams,
you need to estimate the memory needed, Assume total doubles needed =

IIR vaguely, its Num classes * num features * 12/16 bytes.

See if you can actually build a model with that. Else, I would suggest
pruning features from the input vectors like those with < 5 frequency.

Robin

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Thu, Jan 3, 2013 at 2:23 PM, Adam Baron <ad...@gmail.com> wrote:

> I'm trying to run Naïve Bayes on 2.4GB of tfidf-vectors representing a
> bunch of 1-, 2-, 3-grams.  However, no matter how much I increase the
> mapred.child.java.opts, I seem to get "java.lang.OutOfMemoryError: Java
> heap space" errors.  My most recent attempt before e-mailing this mail
> group was 32GB for mapred.child.java.opts and 33GB for mapred.child.ulimit.
>
> I'm using these "mahout trainnb" with these arguments:
> -i [my tfidf-vectors directory on HDFS]
> -el
> -o [name of a model file that does not yet exist, in an HDFS directory that
> does exist]
> -li [name of a label index file that does not yet exist, in an HDFS
> directory that does exist]
> -ow
>
> Any idea what I can try to get this to work?  I don't think I fancy going
> above 32GB for a 2.4GB input file.  Below is the output when I run the
> command:
>
> 13/01/03 14:08:43 INFO common.HadoopUtil: Deleting temp
> 13/01/03 14:09:31 INFO input.FileInputFormat: Total input paths to process
> : 1
> 13/01/03 14:09:32 INFO mapred.JobClient: Running job:
> job_201211120903_15452
> 13/01/03 14:09:33 INFO mapred.JobClient:  map 0% reduce 0%
> 13/01/03 14:09:44 INFO mapred.JobClient:  map 51% reduce 0%
> 13/01/03 14:09:45 INFO mapred.JobClient:  map 71% reduce 0%
> 13/01/03 14:09:47 INFO mapred.JobClient:  map 88% reduce 0%
> 13/01/03 14:09:48 INFO mapred.JobClient:  map 99% reduce 0%
> 13/01/03 14:09:52 INFO mapred.JobClient:  map 100% reduce 0%
> 13/01/03 14:09:59 INFO mapred.JobClient:  map 100% reduce 5%
> 13/01/03 14:10:02 INFO mapred.JobClient:  map 100% reduce 31%
> 13/01/03 14:10:05 INFO mapred.JobClient:  map 100% reduce 33%
> 13/01/03 14:10:08 INFO mapred.JobClient:  map 100% reduce 75%
> 13/01/03 14:10:11 INFO mapred.JobClient:  map 100% reduce 78%
> 13/01/03 14:10:15 INFO mapred.JobClient:  map 100% reduce 82%
> 13/01/03 14:10:17 INFO mapred.JobClient:  map 100% reduce 89%
> 13/01/03 14:10:20 INFO mapred.JobClient:  map 100% reduce 95%
> 13/01/03 14:10:23 INFO mapred.JobClient:  map 100% reduce 100%
> 13/01/03 14:10:29 INFO mapred.JobClient: Job complete:
> job_201211120903_15452
> 13/01/03 14:10:29 INFO mapred.JobClient: Counters: 22
> 13/01/03 14:10:29 INFO mapred.JobClient:   Job Counters
> 13/01/03 14:10:29 INFO mapred.JobClient:     Launched reduce tasks=1
> 13/01/03 14:10:29 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=258302
> 13/01/03 14:10:29 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 13/01/03 14:10:29 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 13/01/03 14:10:29 INFO mapred.JobClient:     Launched map tasks=19
> 13/01/03 14:10:29 INFO mapred.JobClient:     Data-local map tasks=19
> 13/01/03 14:10:29 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=36375
> 13/01/03 14:10:29 INFO mapred.JobClient:   FileSystemCounters
> 13/01/03 14:10:29 INFO mapred.JobClient:     FILE_BYTES_READ=306924353
> 13/01/03 14:10:29 INFO mapred.JobClient:     HDFS_BYTES_READ=2545107495
> 13/01/03 14:10:29 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=614908308
> 13/01/03 14:10:29 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=217788513
> 13/01/03 14:10:29 INFO mapred.JobClient:   Map-Reduce Framework
> 13/01/03 14:10:29 INFO mapred.JobClient:     Reduce input groups=2
> 13/01/03 14:10:29 INFO mapred.JobClient:     Combine output records=20
> 13/01/03 14:10:29 INFO mapred.JobClient:     Map input records=370867
> 13/01/03 14:10:29 INFO mapred.JobClient:     Reduce shuffle bytes=290705921
> 13/01/03 14:10:29 INFO mapred.JobClient:     Reduce output records=2
> 13/01/03 14:10:29 INFO mapred.JobClient:     Spilled Records=40
> 13/01/03 14:10:29 INFO mapred.JobClient:     Map output bytes=2524521040
> 13/01/03 14:10:29 INFO mapred.JobClient:     Combine input records=370867
> 13/01/03 14:10:29 INFO mapred.JobClient:     Map output records=370867
> 13/01/03 14:10:29 INFO mapred.JobClient:     SPLIT_RAW_BYTES=3458
> 13/01/03 14:10:29 INFO mapred.JobClient:     Reduce input records=20
> 13/01/03 14:10:29 INFO input.FileInputFormat: Total input paths to process
> : 1
> 13/01/03 14:10:29 INFO mapred.JobClient: Running job:
> job_201211120903_15453
> 13/01/03 14:10:30 INFO mapred.JobClient:  map 0% reduce 0%
> 13/01/03 14:10:45 INFO mapred.JobClient:  map 50% reduce 0%
> 13/01/03 14:10:47 INFO mapred.JobClient:  map 100% reduce 0%
> 13/01/03 14:11:04 INFO mapred.JobClient:  map 100% reduce 16%
> 13/01/03 14:11:07 INFO mapred.JobClient:  map 100% reduce 33%
> 13/01/03 14:11:10 INFO mapred.JobClient:  map 100% reduce 100%
> 13/01/03 14:11:18 INFO mapred.JobClient: Job complete:
> job_201211120903_15453
> 13/01/03 14:11:18 INFO mapred.JobClient: Counters: 22
> 13/01/03 14:11:18 INFO mapred.JobClient:   Job Counters
> 13/01/03 14:11:18 INFO mapred.JobClient:     Launched reduce tasks=1
> 13/01/03 14:11:18 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=36791
> 13/01/03 14:11:18 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 13/01/03 14:11:18 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 13/01/03 14:11:18 INFO mapred.JobClient:     Launched map tasks=2
> 13/01/03 14:11:18 INFO mapred.JobClient:     Data-local map tasks=2
> 13/01/03 14:11:18 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=20671
> 13/01/03 14:11:18 INFO mapred.JobClient:   FileSystemCounters
> 13/01/03 14:11:18 INFO mapred.JobClient:     FILE_BYTES_READ=202961723
> 13/01/03 14:11:18 INFO mapred.JobClient:     HDFS_BYTES_READ=301359707
> 13/01/03 14:11:18 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=308381584
> 13/01/03 14:11:18 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=205579891
> 13/01/03 14:11:18 INFO mapred.JobClient:   Map-Reduce Framework
> 13/01/03 14:11:18 INFO mapred.JobClient:     Reduce input groups=2
> 13/01/03 14:11:18 INFO mapred.JobClient:     Combine output records=4
> 13/01/03 14:11:18 INFO mapred.JobClient:     Map input records=2
> 13/01/03 14:11:18 INFO mapred.JobClient:     Reduce shuffle bytes=7559204
> 13/01/03 14:11:18 INFO mapred.JobClient:     Reduce output records=2
> 13/01/03 14:11:18 INFO mapred.JobClient:     Spilled Records=10
> 13/01/03 14:11:18 INFO mapred.JobClient:     Map output bytes=217788354
> 13/01/03 14:11:18 INFO mapred.JobClient:     Combine input records=4
> 13/01/03 14:11:18 INFO mapred.JobClient:     Map output records=4
> 13/01/03 14:11:18 INFO mapred.JobClient:     SPLIT_RAW_BYTES=296
> 13/01/03 14:11:18 INFO mapred.JobClient:     Reduce input records=4
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>         at
>
> org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434)
>         at
>
> org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387)
>         at
>
> org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139)
>         at
> org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:118)
>         at
>
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1766)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1894)
>         at
>
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
>         at
>
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
>         at
>
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
>         at
>
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
>         at
> com.google.common.collect.Iterators$5.hasNext(Iterators.java:525)
>         at
>
> com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
>         at
>
> org.apache.mahout.classifier.naivebayes.BayesUtils.readModelFromDir(BayesUtils.java:61)
>         at
>
> org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:137)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at
>
> org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.main(TrainNaiveBayesJob.java:62)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>         at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>         at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>
> Thanks,
>             Adam
>
> PS: I was able to run the classify-20newsgroups.sh example packaged in
> Mahout 0.7 needing only to increase my mapred.child.java.opts to 2GB (since
> it had similar errors at 1GB).
>