You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Deneche A. Hakim (JIRA)" <ji...@apache.org> on 2009/05/27 19:33:45 UTC

[jira] Updated: (MAHOUT-122) Random Forests Reference Implementation

     [ https://issues.apache.org/jira/browse/MAHOUT-122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Deneche A. Hakim updated MAHOUT-122:
------------------------------------

    Status: Patch Available  (was: Open)

This is a basic reference implementation that builds a forest and estimates the out of bag error along the way. I used Mahout's Vector to represent the data instances.
There are no tests at all, I could add them but it will probably take another week, so are they worth it (I mean for a reference imp.) ?
Here is a short description of the files:
* DataUtils.java and Condition.jva: contains helpers methods that deals with of Vector(s) and arrays of values. I don't think they'll play an important role in the distributed version (they need access to all the data at some point)
* Dataset.java: describes the attributes of the data, how many they are, which one are NUMERICAL and which one are NOMINAL. also sometimes its useful to be able to IGNORE some attributes (for e.g. ID attributes)
* DataLoader.java: a simple class that converts an array of comma-separated strings (this is the basic format of the UCI datasets) to a Vector List. 
  It is important to note that to be able to store the NOMINAL attributes, which generally are of type String, this method needs to parse all the data to be able to convert those values into Integer.
  To keep things simple, IGNORED attributes are replaced with Double.NaN, but a more efficient implementation would be to skip them.
  For now this DataLoader does not know how to deal with missing attributes ('?'), an alternative would be to skip the data lines that contains missing attributes
* PostOpData.java: contains the data of the Post Operative Dataset from UCI
* InformationGain.java: entropy and Information Gain calculations. Because those calculations need access to all the data for a given tree node, these methods should probably be rewritten in a map-reduce form
* DecisionTree.java: contains a simple implementation of the LearnUnprunedTree algorithm described in the Wiki. A special case that was not described in the algorithm is when the data becomes empty (happens from time to time), in this case (and based on the Weka implementation of Random Forests) the method should return a Leaf that predicts -1 (which means it cannot predict the label)
  For now the output of the classification is the predicted label, or -1 if no label could be predicted. A better solution would be to return the label's probability distribution
* Node.java: represents a tree node, two implementations are available for NOMINAL and NUMERICAL attributes
* ErrorEstimate.java: computes the errors for each distinct label and the global error rate, using the confusion matrix generated by the building algorithm
* RandomForest.java: uses DecisionTree to build each tree of the forest. Contains a simple example (in the main method), after the execution the program outputs the error rates for each label followed by the total error rate

I started to think about the distributed implementation, but I could (if necessary) spend some time writing the tests for the reference implementation.

> Random Forests Reference Implementation
> ---------------------------------------
>
>                 Key: MAHOUT-122
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-122
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This is the first step of my GSOC project. Implement a simple, easy to understand, reference implementation of Random Forests (Building and Classification). The only requirement here is that "it works"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (MAHOUT-122) Random Forests Reference Implementation

Posted by Ted Dunning <te...@gmail.com>.

Tests are incredibly valuable, especially if they can be applied to the next
version of the algorithm.

On Wed, May 27, 2009 at 10:33 AM, Deneche A. Hakim (JIRA)
<ji...@apache.org>wrote:

> There are no tests at all, I could add them but it will probably take
> another week, so are they worth it (I mean for a reference imp.) ?
>



-- 
Ted Dunning, CTO
DeepDyve