You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Ikumasa Mukai (Updated) (JIRA)" <ji...@apache.org> on 2011/12/08 18:37:41 UTC

[jira] [Updated] (MAHOUT-840) Decision Forests should support Regression problems

     [ https://issues.apache.org/jira/browse/MAHOUT-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ikumasa Mukai updated MAHOUT-840:
---------------------------------

    Attachment: DecisionTreeBuilderTest.java
                MAHOUT-840.patch

Hi Hakim-san.

Sorry for late!
I add a new patch (MAHOUT-840.patch) which has the new TreeBuilder and more.

The additions are ..

1) Added DecisionTreeBuilder

This class can be used for making the classification and regression tree.
On making regression tree, this uses the variance.

And this class has functions for complementing the lacked leaves and preventing the overfitting for both trees.

For complementing, the parent stem's other leaves are used.
For Preventing, the number of data on the leaf is used. (for regression tree the value of variance is also checked. ) 

2) Added RegressionResultAnalyzer

This class shows the result like this.

{noformat}
=======================================================
Summary
-------------------------------------------------------
Correlation coefficient                 :     1.0076
Mean absolute error                     :     1.8083
Root mean squared error                 :     2.5944
Total Regressed Instances               :         50
{noformat}

3) How to use:
I added "-b" param on the BuildForest for selecting the TreeBuilder class.

{noformat}
 org.apache.mahout.df.mapreduce.BuildForest \
-Dmapred.max.split.size=1874231 \
-oob \
-d $KDD_DATA/KDDTrain+.arff \
-ds $KDD_DATA/KDDTrain+.info \
-sl 5 \
-p \
-t 100 \
-b org.apache.mahout.classifier.df.builder.DecisionTreeBuilder
-o $KDD_DATA/model
{noformat}

For the classification and regression, I tested this patch with visual-test using DecisionTreeBuilderTest.java.
This class uses the TreePrinter and the ArffDataLoader.

"The TreePrinter" can be used for making the model data visible like this.

i. iris - classification
{noformat}petallength < 3.3 : Iris-setosa
petallength >= 3.3
|   petalwidth < 1.8
|   |   petallength < 5
|   |   |   petalwidth < 1.7 : Iris-versicolor
|   |   |   petalwidth >= 1.7 : Iris-virginica
|   |   petallength >= 5
|   |   |   petalwidth < 1.6 : Iris-virginica
|   |   |   petalwidth >= 1.6
|   |   |   |   sepallength < 7.2 : Iris-versicolor
|   |   |   |   sepallength >= 7.2 : Iris-virginica
|   petalwidth >= 1.8
|   |   petallength < 4.9
|   |   |   sepallength < 6 : Iris-versicolor
|   |   |   sepallength >= 6 : Iris-virginica
|   |   petallength >= 4.9 : Iris-virginica
{noformat} 

ii. cars - regression
{noformat}speed < 30
|   speed < 12
|   |   speed < 3 : 4
|   |   speed >= 3
|   |   |   speed < 7 : 7
|   |   |   speed >= 7 : 6.5
|   speed >= 12
|   |   speed < 23
|   |   |   speed < 21
|   |   |   |   speed < 19
|   |   |   |   |   speed < 15 : 12
|   |   |   |   |   speed >= 15
|   |   |   |   |   |   speed < 16.5 : 8
|   |   |   |   |   |   speed >= 16.5
|   |   |   |   |   |   |   speed < 17.5 : 11
|   |   |   |   |   |   |   speed >= 17.5 : 10
|   |   |   |   speed >= 19 : 13.5
|   |   |   speed >= 21 : 7
|   |   speed >= 23
|   |   |   speed < 27
|   |   |   |   speed < 25 : 12
|   |   |   |   speed >= 25 : 13
|   |   |   speed >= 27 : 11.5
speed >= 30
|   speed < 84.5
---snip---
{noformat}

And "the ArffDataLoader" can read ARFF format data file ans is good for making the test easy.

These 2 additions are contained on the last regression.patch. 

Regards,
                
> Decision Forests should support Regression problems
> ---------------------------------------------------
>
>                 Key: MAHOUT-840
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-840
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>         Attachments: DecisionTreeBuilderTest.java, MAHOUT-840.patch, regression.patch, regression.patch, regression.patch
>
>
> Improve Decision Forest code in order to handle numerical targets, thus supporting regression problems

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira