You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sara Del Río García (JIRA)" <ji...@apache.org> on 2013/02/28 21:39:12 UTC

[jira] [Commented] (MAHOUT-145) PartialData mapreduce Random Forests

    [ https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589902#comment-13589902 ] 

Sara Del Río García commented on MAHOUT-145:
--------------------------------------------

Hello Deneche A. Hakim:

I'm testing the Random Forest Partial version in the version of Hadoop: Hadoop 2.0.0-cdh4.1.1

I'm trying to modify the algorithm, all I do is add more information to the leaves of the tree. Currently containing the label and I want to add another label more:

@Override
  public void readFields(DataInput in) throws IOException {	
	label = in.readDouble();
	leafWeight = in.readDouble();
  }
  
  @Override
  protected void writeNode(DataOutput out) throws IOException {
	out.writeDouble(label);
	out.writeDouble(leafWeight);
  }


And I get the following error:

13/02/27 06:53:27 INFO mapreduce.BuildForest: Partial Mapred implementation
13/02/27 06:53:27 INFO mapreduce.BuildForest: Building the forest...
13/02/27 06:53:27 INFO mapreduce.BuildForest: Weights Estimation: IR
13/02/27 06:53:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/02/27 06:53:39 INFO input.FileInputFormat: Total input paths to process : 1
13/02/27 06:53:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/02/27 06:53:39 WARN snappy.LoadSnappy: Snappy native library not loaded
13/02/27 06:53:39 INFO mapred.JobClient: Running job: job_201302270205_0013
13/02/27 06:53:40 INFO mapred.JobClient:  map 0% reduce 0%
13/02/27 06:54:18 INFO mapred.JobClient:  map 20% reduce 0%
13/02/27 06:54:42 INFO mapred.JobClient:  map 40% reduce 0%
13/02/27 06:55:03 INFO mapred.JobClient:  map 60% reduce 0%
13/02/27 06:55:26 INFO mapred.JobClient:  map 70% reduce 0%
13/02/27 06:55:27 INFO mapred.JobClient:  map 80% reduce 0%
13/02/27 06:55:49 INFO mapred.JobClient:  map 100% reduce 0%
13/02/27 06:56:04 INFO mapred.JobClient: Job complete: job_201302270205_0013
13/02/27 06:56:04 INFO mapred.JobClient: Counters: 24
13/02/27 06:56:04 INFO mapred.JobClient:   File System Counters
13/02/27 06:56:04 INFO mapred.JobClient:     FILE: Number of bytes read=0
13/02/27 06:56:04 INFO mapred.JobClient:     FILE: Number of bytes written=1828230
13/02/27 06:56:04 INFO mapred.JobClient:     FILE: Number of read operations=0
13/02/27 06:56:04 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/02/27 06:56:04 INFO mapred.JobClient:     FILE: Number of write operations=0
13/02/27 06:56:04 INFO mapred.JobClient:     HDFS: Number of bytes read=1381649
13/02/27 06:56:04 INFO mapred.JobClient:     HDFS: Number of bytes written=1680
13/02/27 06:56:04 INFO mapred.JobClient:     HDFS: Number of read operations=30
13/02/27 06:56:04 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/02/27 06:56:04 INFO mapred.JobClient:     HDFS: Number of write operations=10
13/02/27 06:56:04 INFO mapred.JobClient:   Job Counters 
13/02/27 06:56:04 INFO mapred.JobClient:     Launched map tasks=10
13/02/27 06:56:04 INFO mapred.JobClient:     Data-local map tasks=10
13/02/27 06:56:04 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=254707
13/02/27 06:56:04 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=0
13/02/27 06:56:04 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/02/27 06:56:04 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/02/27 06:56:04 INFO mapred.JobClient:   Map-Reduce Framework
13/02/27 06:56:04 INFO mapred.JobClient:     Map input records=20
13/02/27 06:56:04 INFO mapred.JobClient:     Map output records=10
13/02/27 06:56:04 INFO mapred.JobClient:     Input split bytes=1540
13/02/27 06:56:04 INFO mapred.JobClient:     Spilled Records=0
13/02/27 06:56:04 INFO mapred.JobClient:     CPU time spent (ms)=12070
13/02/27 06:56:04 INFO mapred.JobClient:     Physical memory (bytes) snapshot=949579776
13/02/27 06:56:04 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=8412340224
13/02/27 06:56:04 INFO mapred.JobClient:     Total committed heap usage (bytes)=478412800
READ 
nodetype: 0
Exception in thread "main" java.lang.IllegalStateException: java.io.EOFException
	at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:104)
	at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
	at org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilder.processOutput(PartialBuilder.java:129)
	at org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilder.parseOutput(PartialBuilder.java:96)
	at org.apache.mahout.classifier.df.mapreduce.Builder.build(Builder.java:312)
	at org.apache.mahout.classifier.df.mapreduce.BuildForest.buildForest(BuildForest.java:246)
	at org.apache.mahout.classifier.df.mapreduce.BuildForest.run(BuildForest.java:200)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.mahout.classifier.df.mapreduce.BuildForest.main(BuildForest.java:270)
Caused by: java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:180)
	at java.io.DataInputStream.readLong(DataInputStream.java:399)
	at java.io.DataInputStream.readDouble(DataInputStream.java:451)
	at org.apache.mahout.classifier.df.node.Leaf.readFields(Leaf.java:136)
	at org.apache.mahout.classifier.df.node.Node.read(Node.java:85)
	at org.apache.mahout.classifier.df.mapreduce.MapredOutput.readFields(MapredOutput.java:64)
	at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2114)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2242)
	at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
	... 10 more

What's the problem?
You can try to write something more in the leaves of the tree? Anything.

Thank you very much.

Best regards,

Sara
                
> PartialData mapreduce Random Forests
> ------------------------------------
>
>                 Key: MAHOUT-145
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-145
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>            Priority: Minor
>             Fix For: 0.2
>
>         Attachments: partial_August_10.patch, partial_August_13.patch, partial_August_15.patch, partial_August_17.patch, partial_August_19.patch, partial_August_24.patch, partial_August_27.patch, partial_August_2.patch, partial_August_31.patch, partial_August_9.patch, partial_Sep_15.patch, partial_Sep_30.patch
>
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions of the data. That loses some of the solidity of the original method, but could actually do better if the splits exposed non-stationary behavior."

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira