You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Deneche A. Hakim (JIRA)" <ji...@apache.org> on 2009/06/12 11:46:07 UTC

[jira] Issue Comment Edited: (MAHOUT-122) Random Forests Reference Implementation

    [ https://issues.apache.org/jira/browse/MAHOUT-122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718771#action_12718771 ] 

Deneche A. Hakim edited comment on MAHOUT-122 at 6/12/09 2:44 AM:
------------------------------------------------------------------

I was wrong about the memory usage of the current implementation, even that each node has its own Data object, all the Data object still share the same Instance objects which all the actual data.

I did some profiling and I found that "InformationGain.computeSplit()" method takes nearly 98.5% of total time, this is responsible for computing the Information Gain for the current split. So if we want later to optimize this implementation we'll have to use a better algorithm to compute the Information Gain, the one that I'm aware of and which is available in the Weka source code, computes
 the sorting indices for the data with each attribute.

I also did some memory usage profiling using a Runnable that samples every 50ms a rough estimation of memory usage using (Runtime.getTotalMemory() - Runtime.getFreeMemory()). I used the KDD dataset (> 700 Mb of data), I then created different datasets using subsets of different size (1%, 10%, 25%, 50%). Here are the results :

KDD has 41 attributes (stored as "double")
KDD  1% has      49402 instances
KDD 10% has   494021 instances
KDD 25% has 1224607 instances

KDD 1% contains 
Dataset       |Nb Trees  |  MUALD(*)         |    Max Used Memory  |  Nb Nodes    |  Max Tree Depth
KDD  1%     | 1            |   35.414.504 b  |     38.069.640 b       |  120             |  10
KDD  1%     |10           |   35.144.096 b  |     45.669.904 b       |  126 (mean) |  11 (mean) 

KDD 10%    | 1            | 201.697.512 b  |   226.653.392 b       |  712             |  22

KDD 25%    | 1            | 521.515.136 b  |   569.795.152 b       |  930             |  26

(*) Memory used right after loading the Data

I should run more tests using KDD 50% and KDD 100%, and also building more trees to see how the memory usage behaves. But because the current implementation is very slow, it may take some time

ps: edited the results table to make it somehow more readable

      was (Author: adeneche):
    I was wrong about the memory usage of the current implementation, even that each node has its own Data object, all the Data object still share the same Instance objects which all the actual data.

I did some profiling and I found that "InformationGain.computeSplit()" method takes nearly 98.5% of total time, this is responsible for computing the Information Gain for the current split. So if we want later to optimize this implementation we'll have to use a better algorithm to compute the Information Gain, the one that I'm aware of and which is available in the Weka source code, computes
 the sorting indices for the data with each attribute.

I also did some memory usage profiling using a Runnable that samples every 50ms a rough estimation of memory usage using (Runtime.getTotalMemory() - Runtime.getFreeMemory()). I used the KDD dataset (> 700 Mb of data), I then created different datasets using subsets of different size (1%, 10%, 25%, 50%). Here are the results :

KDD has 41 attributes (stored as "double")
KDD  1% has      49402 instances
KDD 10% has   494021 instances
KDD 25% has 1224607 instances

KDD 1% contains 
Dataset       Nb Trees    MUALD(*)             Max Used Memory    Nb Nodes      Max Tree Depth
KDD  1%      1               35.414.504 b       38.069.640 b         120               10
KDD  1%     10              35.144.096 b       45.669.904 b         126 (mean)   11 (mean) 


KDD 10%     1             201.697.512 b     226.653.392 b         712               22

KDD 25%     1             521.515.136 b     569.795.152 b         930               26

(*) Memory used right after loading the Data

I should run more tests using KDD 50% and KDD 100%, and also building more trees to see how the memory usage behaves. But because the current implementation is very slow, it may take some time
  
> Random Forests Reference Implementation
> ---------------------------------------
>
>                 Key: MAHOUT-122
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-122
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>         Attachments: 2w_patch.diff, RF reference.patch
>
>   Original Estimate: 25h
>  Remaining Estimate: 25h
>
> This is the first step of my GSOC project. Implement a simple, easy to understand, reference implementation of Random Forests (Building and Classification). The only requirement here is that "it works"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.