You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Deneche A. Hakim (JIRA)" <ji...@apache.org> on 2009/06/12 11:46:07 UTC
[jira] Issue Comment Edited: (MAHOUT-122) Random Forests Reference
Implementation
[ https://issues.apache.org/jira/browse/MAHOUT-122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718771#action_12718771 ]
Deneche A. Hakim edited comment on MAHOUT-122 at 6/12/09 2:44 AM:
------------------------------------------------------------------
I was wrong about the memory usage of the current implementation, even that each node has its own Data object, all the Data object still share the same Instance objects which all the actual data.
I did some profiling and I found that "InformationGain.computeSplit()" method takes nearly 98.5% of total time, this is responsible for computing the Information Gain for the current split. So if we want later to optimize this implementation we'll have to use a better algorithm to compute the Information Gain, the one that I'm aware of and which is available in the Weka source code, computes
the sorting indices for the data with each attribute.
I also did some memory usage profiling using a Runnable that samples every 50ms a rough estimation of memory usage using (Runtime.getTotalMemory() - Runtime.getFreeMemory()). I used the KDD dataset (> 700 Mb of data), I then created different datasets using subsets of different size (1%, 10%, 25%, 50%). Here are the results :
KDD has 41 attributes (stored as "double")
KDD 1% has 49402 instances
KDD 10% has 494021 instances
KDD 25% has 1224607 instances
KDD 1% contains
Dataset |Nb Trees | MUALD(*) | Max Used Memory | Nb Nodes | Max Tree Depth
KDD 1% | 1 | 35.414.504 b | 38.069.640 b | 120 | 10
KDD 1% |10 | 35.144.096 b | 45.669.904 b | 126 (mean) | 11 (mean)
KDD 10% | 1 | 201.697.512 b | 226.653.392 b | 712 | 22
KDD 25% | 1 | 521.515.136 b | 569.795.152 b | 930 | 26
(*) Memory used right after loading the Data
I should run more tests using KDD 50% and KDD 100%, and also building more trees to see how the memory usage behaves. But because the current implementation is very slow, it may take some time
ps: edited the results table to make it somehow more readable
was (Author: adeneche):
I was wrong about the memory usage of the current implementation, even that each node has its own Data object, all the Data object still share the same Instance objects which all the actual data.
I did some profiling and I found that "InformationGain.computeSplit()" method takes nearly 98.5% of total time, this is responsible for computing the Information Gain for the current split. So if we want later to optimize this implementation we'll have to use a better algorithm to compute the Information Gain, the one that I'm aware of and which is available in the Weka source code, computes
the sorting indices for the data with each attribute.
I also did some memory usage profiling using a Runnable that samples every 50ms a rough estimation of memory usage using (Runtime.getTotalMemory() - Runtime.getFreeMemory()). I used the KDD dataset (> 700 Mb of data), I then created different datasets using subsets of different size (1%, 10%, 25%, 50%). Here are the results :
KDD has 41 attributes (stored as "double")
KDD 1% has 49402 instances
KDD 10% has 494021 instances
KDD 25% has 1224607 instances
KDD 1% contains
Dataset Nb Trees MUALD(*) Max Used Memory Nb Nodes Max Tree Depth
KDD 1% 1 35.414.504 b 38.069.640 b 120 10
KDD 1% 10 35.144.096 b 45.669.904 b 126 (mean) 11 (mean)
KDD 10% 1 201.697.512 b 226.653.392 b 712 22
KDD 25% 1 521.515.136 b 569.795.152 b 930 26
(*) Memory used right after loading the Data
I should run more tests using KDD 50% and KDD 100%, and also building more trees to see how the memory usage behaves. But because the current implementation is very slow, it may take some time
> Random Forests Reference Implementation
> ---------------------------------------
>
> Key: MAHOUT-122
> URL: https://issues.apache.org/jira/browse/MAHOUT-122
> Project: Mahout
> Issue Type: Task
> Components: Classification
> Affects Versions: 0.2
> Reporter: Deneche A. Hakim
> Attachments: 2w_patch.diff, RF reference.patch
>
> Original Estimate: 25h
> Remaining Estimate: 25h
>
> This is the first step of my GSOC project. Implement a simple, easy to understand, reference implementation of Random Forests (Building and Classification). The only requirement here is that "it works"
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.