You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Deneche A. Hakim (JIRA)" <ji...@apache.org> on 2009/06/06 13:34:08 UTC

[jira] Updated: (MAHOUT-122) Random Forests Reference Implementation

     [ https://issues.apache.org/jira/browse/MAHOUT-122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Deneche A. Hakim updated MAHOUT-122:
------------------------------------

    Attachment: 2w_patch.diff

*second week patch*
work in progress...

*changes:*
* added many tests, although some are still missing

* added a new class "Instance" that allows me to add an ID and a separate LABEL to a Vector

* DataLoader.loadData(String, FileSystem, Path) loads the data from a file, IGNORED attributes are skipped

* Dataset handles only NUMERICAL and CATEGORICAL attributes

  ** contains List<String> that represents the labels as found in the data, before being converted to int

* added a new class "Data" that represents the data being loaded

  ** contains methods to create subset from the current Data

  ** the only way to get a new Data instance is to load it with DataLoader, or to use methods from an existing Data instance

  ** this class could prove useful later to optimize the memory usage of the data

* ForestBuilder.buildForest uses a PredictionCallback to collect the oob predictions, by changing the callback we can compute different errors rate, for example:

  ** Forest out-of-bag error estimation

  ** mean tree error rate

  ** ...

* I added a small running example in ForestBuilder.main(), this example shows a typical use of Random Forests:

  ** loads the data from a file, you'll need to provide a descriptor. For example UciDescriptors.java contains the descriptors for the "glass" and "post-operative" UCI datasets, the datasets are available at the UCI web site)

  ** reserves 10% of the data as a test set (not used for now)

  ** builds a random forest using the remaining data

  ** computes the oob error estimation

  ** this procedure is repeated 100 times and the mean oob error estimation is printed

if you want to try the example, you'll need to download the "post-operative" dataset, or the "glass" dataset from UCI, put it somewhere, and change the first line of ForestBuilder.main() to the correct path, and use the corresponding UciDescriptor in the third line.

*Note about memory usage:*

* the reference implementation loads the data in-memory, then builds the trees one at a time

* each tree is built recursively using DecisionTree.learnUnprunedTree(), at each node the data is split and learnUnprunedTree() is called for each subset

* the current implementation of "Data" is not memory efficient, each subset keeps it own copy of its part of the data, thus, except when there are Leaf nodes, each level of the tree generates one more copy of the data in memory

*Whats next:*

* RandomForest class that will contain the result of the forest building, can be stored/loaded from a file

* try the implementation on the same UCI datasets as the Breiman's paper, using the same complete procedure

* do some memory usage monitoring

> Random Forests Reference Implementation
> ---------------------------------------
>
>                 Key: MAHOUT-122
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-122
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>         Attachments: 2w_patch.diff, RF reference.patch
>
>   Original Estimate: 25h
>  Remaining Estimate: 25h
>
> This is the first step of my GSOC project. Implement a simple, easy to understand, reference implementation of Random Forests (Building and Classification). The only requirement here is that "it works"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.