You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Sara Del Río García <sr...@decsai.ugr.es> on 2013/02/28 22:06:36 UTC

Partial Implementation of Random Forests

Hello all:

I'm testing the Random Forest Partial version in the version of Hadoop: 
Hadoop 2.0.0-cdh4.1.1

I'm trying to modify the algorithm, all I do is add more information to 
the leaves of the tree. Currently containing the label and I want to add 
another label more:

@Override
public void readFields(DataInput in) throws IOException{

label = in.readDouble();
leafWeight = in.readDouble();

}

@Override
protected void writeNode(DataOutput out) throws IOException{

out.writeDouble(label);
out.writeDouble(leafWeight);

}

And I get the following error:

13/02/27 06:53:27 INFO mapreduce.BuildForest: Partial Mapred implementation
13/02/27 06:53:27 INFO mapreduce.BuildForest: Building the forest...
13/02/27 06:53:27 INFO mapreduce.BuildForest: Weights Estimation: IR
13/02/27 06:53:37 WARN mapred.JobClient: Use GenericOptionsParser for 
parsing the arguments. Applications should implement Tool for the same.
13/02/27 06:53:39 INFO input.FileInputFormat: Total input paths to 
process : 1
13/02/27 06:53:39 WARN util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes 
where applicable
13/02/27 06:53:39 WARN snappy.LoadSnappy: Snappy native library not loaded
13/02/27 06:53:39 INFO mapred.JobClient: Running job: job_201302270205_0013
13/02/27 06:53:40 INFO mapred.JobClient: map 0% reduce 0%
13/02/27 06:54:18 INFO mapred.JobClient: map 20% reduce 0%
13/02/27 06:54:42 INFO mapred.JobClient: map 40% reduce 0%
13/02/27 06:55:03 INFO mapred.JobClient: map 60% reduce 0%
13/02/27 06:55:26 INFO mapred.JobClient: map 70% reduce 0%
13/02/27 06:55:27 INFO mapred.JobClient: map 80% reduce 0%
13/02/27 06:55:49 INFO mapred.JobClient: map 100% reduce 0%
13/02/27 06:56:04 INFO mapred.JobClient: Job complete: job_201302270205_0013
13/02/27 06:56:04 INFO mapred.JobClient: Counters: 24
13/02/27 06:56:04 INFO mapred.JobClient: File System Counters
13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of bytes read=0
13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of bytes 
written=1828230
13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of read operations=0
13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of large read 
operations=0
13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of write operations=0
13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of bytes read=1381649
13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of bytes written=1680
13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of read operations=30
13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of large read 
operations=0
13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of write operations=10
13/02/27 06:56:04 INFO mapred.JobClient: Job Counters
13/02/27 06:56:04 INFO mapred.JobClient: Launched map tasks=10
13/02/27 06:56:04 INFO mapred.JobClient: Data-local map tasks=10
13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all maps in 
occupied slots (ms)=254707
13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all reduces 
in occupied slots (ms)=0
13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all maps 
waiting after reserving slots (ms)=0
13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all reduces 
waiting after reserving slots (ms)=0
13/02/27 06:56:04 INFO mapred.JobClient: Map-Reduce Framework
13/02/27 06:56:04 INFO mapred.JobClient: Map input records=20
13/02/27 06:56:04 INFO mapred.JobClient: Map output records=10
13/02/27 06:56:04 INFO mapred.JobClient: Input split bytes=1540
13/02/27 06:56:04 INFO mapred.JobClient: Spilled Records=0
13/02/27 06:56:04 INFO mapred.JobClient: CPU time spent (ms)=12070
13/02/27 06:56:04 INFO mapred.JobClient: Physical memory (bytes) 
snapshot=949579776
13/02/27 06:56:04 INFO mapred.JobClient: Virtual memory (bytes) 
snapshot=8412340224
13/02/27 06:56:04 INFO mapred.JobClient: Total committed heap usage 
(bytes)=478412800
Exception in thread "main" java.lang.IllegalStateException: 
java.io.EOFException
at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:104)
at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
at 
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at 
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at 
org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilder.processOutput(PartialBuilder.java:129)
at 
org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilder.parseOutput(PartialBuilder.java:96)
at org.apache.mahout.classifier.df.mapreduce.Builder.build(Builder.java:312)
at 
org.apache.mahout.classifier.df.mapreduce.BuildForest.buildForest(BuildForest.java:246)
at 
org.apache.mahout.classifier.df.mapreduce.BuildForest.run(BuildForest.java:200)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at 
org.apache.mahout.classifier.df.mapreduce.BuildForest.main(BuildForest.java:270)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at java.io.DataInputStream.readDouble(DataInputStream.java:451)
at org.apache.mahout.classifier.df.node.Leaf.readFields(Leaf.java:136)
at org.apache.mahout.classifier.df.node.Node.read(Node.java:85)
at 
org.apache.mahout.classifier.df.mapreduce.MapredOutput.readFields(MapredOutput.java:64)
at 
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2114)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2242)
at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
... 10 more

What's the problem?

You can try to write more information in the leaves of the tree?

Thank you very much.


Best regards,

Sara

Re: Partial Implementation of Random Forests

Posted by Marty Kube <ma...@gmail.com>.

Hi Sara,
On the surface your change looks okay to me.  But it's hard say really.
It looks like the code expected to read more data.  Perhaps you add some 
logging around the statements that failed and try to get a sense of how 
much and what data had been successfully read just prior to the failure.
Did you change anything else?  Maybe you could post the diffs.
Marty


On 02/28/2013 04:06 PM, Sara Del Río García wrote:
> Hello all:
>
> I'm testing the Random Forest Partial version in the version of 
> Hadoop: Hadoop 2.0.0-cdh4.1.1
>
> I'm trying to modify the algorithm, all I do is add more information 
> to the leaves of the tree. Currently containing the label and I want 
> to add another label more:
>
> @Override
> public void readFields(DataInput in) throws IOException{
>
> label = in.readDouble();
> leafWeight = in.readDouble();
>
> }
>
> @Override
> protected void writeNode(DataOutput out) throws IOException{
>
> out.writeDouble(label);
> out.writeDouble(leafWeight);
>
> }
>
> And I get the following error:
>
> 13/02/27 06:53:27 INFO mapreduce.BuildForest: Partial Mapred 
> implementation
> 13/02/27 06:53:27 INFO mapreduce.BuildForest: Building the forest...
> 13/02/27 06:53:27 INFO mapreduce.BuildForest: Weights Estimation: IR
> 13/02/27 06:53:37 WARN mapred.JobClient: Use GenericOptionsParser for 
> parsing the arguments. Applications should implement Tool for the same.
> 13/02/27 06:53:39 INFO input.FileInputFormat: Total input paths to 
> process : 1
> 13/02/27 06:53:39 WARN util.NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes 
> where applicable
> 13/02/27 06:53:39 WARN snappy.LoadSnappy: Snappy native library not 
> loaded
> 13/02/27 06:53:39 INFO mapred.JobClient: Running job: 
> job_201302270205_0013
> 13/02/27 06:53:40 INFO mapred.JobClient: map 0% reduce 0%
> 13/02/27 06:54:18 INFO mapred.JobClient: map 20% reduce 0%
> 13/02/27 06:54:42 INFO mapred.JobClient: map 40% reduce 0%
> 13/02/27 06:55:03 INFO mapred.JobClient: map 60% reduce 0%
> 13/02/27 06:55:26 INFO mapred.JobClient: map 70% reduce 0%
> 13/02/27 06:55:27 INFO mapred.JobClient: map 80% reduce 0%
> 13/02/27 06:55:49 INFO mapred.JobClient: map 100% reduce 0%
> 13/02/27 06:56:04 INFO mapred.JobClient: Job complete: 
> job_201302270205_0013
> 13/02/27 06:56:04 INFO mapred.JobClient: Counters: 24
> 13/02/27 06:56:04 INFO mapred.JobClient: File System Counters
> 13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of bytes read=0
> 13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of bytes 
> written=1828230
> 13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of read 
> operations=0
> 13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of large read 
> operations=0
> 13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of write 
> operations=0
> 13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of bytes 
> read=1381649
> 13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of bytes 
> written=1680
> 13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of read 
> operations=30
> 13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of large read 
> operations=0
> 13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of write 
> operations=10
> 13/02/27 06:56:04 INFO mapred.JobClient: Job Counters
> 13/02/27 06:56:04 INFO mapred.JobClient: Launched map tasks=10
> 13/02/27 06:56:04 INFO mapred.JobClient: Data-local map tasks=10
> 13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all maps 
> in occupied slots (ms)=254707
> 13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all 
> reduces in occupied slots (ms)=0
> 13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all maps 
> waiting after reserving slots (ms)=0
> 13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all 
> reduces waiting after reserving slots (ms)=0
> 13/02/27 06:56:04 INFO mapred.JobClient: Map-Reduce Framework
> 13/02/27 06:56:04 INFO mapred.JobClient: Map input records=20
> 13/02/27 06:56:04 INFO mapred.JobClient: Map output records=10
> 13/02/27 06:56:04 INFO mapred.JobClient: Input split bytes=1540
> 13/02/27 06:56:04 INFO mapred.JobClient: Spilled Records=0
> 13/02/27 06:56:04 INFO mapred.JobClient: CPU time spent (ms)=12070
> 13/02/27 06:56:04 INFO mapred.JobClient: Physical memory (bytes) 
> snapshot=949579776
> 13/02/27 06:56:04 INFO mapred.JobClient: Virtual memory (bytes) 
> snapshot=8412340224
> 13/02/27 06:56:04 INFO mapred.JobClient: Total committed heap usage 
> (bytes)=478412800
> Exception in thread "main" java.lang.IllegalStateException: 
> java.io.EOFException
> at 
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:104)
> at 
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
> at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
> at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
> at 
> org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilder.processOutput(PartialBuilder.java:129)
> at 
> org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilder.parseOutput(PartialBuilder.java:96)
> at 
> org.apache.mahout.classifier.df.mapreduce.Builder.build(Builder.java:312)
> at 
> org.apache.mahout.classifier.df.mapreduce.BuildForest.buildForest(BuildForest.java:246)
> at 
> org.apache.mahout.classifier.df.mapreduce.BuildForest.run(BuildForest.java:200)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at 
> org.apache.mahout.classifier.df.mapreduce.BuildForest.main(BuildForest.java:270)
> Caused by: java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:180)
> at java.io.DataInputStream.readLong(DataInputStream.java:399)
> at java.io.DataInputStream.readDouble(DataInputStream.java:451)
> at org.apache.mahout.classifier.df.node.Leaf.readFields(Leaf.java:136)
> at org.apache.mahout.classifier.df.node.Node.read(Node.java:85)
> at 
> org.apache.mahout.classifier.df.mapreduce.MapredOutput.readFields(MapredOutput.java:64)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2114)
> at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2242)
> at 
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
> ... 10 more
>
> What's the problem?
>
> You can try to write more information in the leaves of the tree?
>
> Thank you very much.
>
>
> Best regards,
>
> Sara
>
>