You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Deneche A. Hakim (JIRA)" <ji...@apache.org> on 2010/01/14 11:15:54 UTC

[jira] Commented: (MAHOUT-245) Better handling of Categorical attributes when building Decision Forests

    [ https://issues.apache.org/jira/browse/MAHOUT-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800156#action_12800156 ] 

Deneche A. Hakim commented on MAHOUT-245:
-----------------------------------------

I modified the code to not select Categorical attributes that have been selected in one of the parent nodes.
I also modified the BreimanExample to show the mean (relative to all iterations of the example) number of nodes in all the trees of the built forests.

I tested on two UCI datasets:
* [glass identification dataset | http://archive.ics.uci.edu/ml/datasets/Glass+Identification]: This dataset contains only numerical attributes, hence it should not be affected by the modification. The test runs 100 iterations, each building 100 trees
* [poker hand (training) dataset | http://archive.ics.uci.edu/ml/datasets/Poker+Hand]: This dataset contains 10 categorical attributes. the test runs 10 iterations, each building 100 trees

The results are (Before the modification):

|| Dataset || Selection || Single Input || One Tree || Mean RI Time || Mean SI Time || Mean RI num nodes || Mean SI num nodes ||
| glass | 25.2% | 25.6% | 40.1% | 1s 27ms | 0s 497ms | 6715 | 11419 |
| poker | 27.5% | 37.8% | 44.2%| 1m 14s 855ms | 58s 200ms | 1442811 | 2133194 |

The results are (After the modification):

|| Dataset || Selection || Single Input || One Tree || Mean RI Time || Mean SI Time || Mean RI num nodes || Mean SI num nodes ||
| glass | 22.5% | 22.8% | 39.8% | 0s 935ms | 0s 442ms | 6735 | 11528 |
| poker | 27.8% | 38.0% | 42.9% | 53s 24ms | 36s 818ms | 1372914 | 1700049 |

The Breiman Example and the meaning of the columns are described [here | http://issues.apache.org/jira/browse/MAHOUT-122?focusedCommentId=12718777&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12718777]
* Mean RI num nodes: mean number of nodes in the forest built using Random Selection
* Mean SI num nodes: mean number of nodes in the forest built using Single-Input Selection

the variations in the error rates are due (I hope) to the randomness in the process. The built times are relative (but note that I'm running Ubuntu inside a VirtualBox). and we can see that the modification effectively reduces the number of nodes in the "poker" dataset.


> Better handling of Categorical attributes when building Decision Forests
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-245
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-245
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>             Fix For: 0.3
>
>         Attachments: mahout-245.patch
>
>
> When building a decision tree, at each node a random subset from all variables (attributes) is considered for the node split.
> If a Categorical variable has been selected, the data available at the node is split such that each child node has the same value for the selected variable. In all sub-nodes the selected variable should not be selected again, but the current implementation does not account for that. The resulting tree may contain redundant nodes that does not impair its classification performance but are nonetheless unnecessary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.