You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Ikumasa Mukai (Created) (JIRA)" <ji...@apache.org> on 2012/01/05 18:01:39 UTC

[jira] [Created] (MAHOUT-943) Improbe the way to make the split point on DF.

Improbe the way to make the split point on DF.
----------------------------------------------

                 Key: MAHOUT-943
                 URL: https://issues.apache.org/jira/browse/MAHOUT-943
             Project: Mahout
          Issue Type: Improvement
          Components: Classification
            Reporter: Ikumasa Mukai


The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.

But I think this is a little too strict and think it is better on some situation to  use the average value which is calced with the best IG value and the 2nd value.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-943) Improbe the way to make the split point on DF.

Posted by "Ikumasa Mukai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181295#comment-13181295 ] 

Ikumasa Mukai commented on MAHOUT-943:
--------------------------------------

Thank you for your advice.
Yes, I have checked the way to implement and agree with you to make a option for BuildForest for switching.

I will post a patch if it will be done here!
                
> Improbe the way to make the split point on DF.
> ----------------------------------------------
>
>                 Key: MAHOUT-943
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-943
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Ikumasa Mukai
>              Labels: DecisionForest
>
> The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.
> But I think this is a little too strict and think it is better on some situation to  use the average value which is calced with the best IG value and the 2nd value.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-943) Improbe the way to make the split point on DF.

Posted by "Ikumasa Mukai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ikumasa Mukai updated MAHOUT-943:
---------------------------------

    Status: Patch Available  (was: Open)
    
> Improbe the way to make the split point on DF.
> ----------------------------------------------
>
>                 Key: MAHOUT-943
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-943
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Ikumasa Mukai
>              Labels: DecisionForest
>         Attachments: MAHOUT-943.patch
>
>
> The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.
> But I think this is a little too strict and think it is better on some situation to  use the average value which is calced with the best IG value and the 2nd value.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-943) Improbe the way to make the split point on DF.

Posted by "Deneche A. Hakim (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180569#comment-13180569 ] 

Deneche A. Hakim commented on MAHOUT-943:
-----------------------------------------

You can inherit from IgSplit and provide your own implementation. But we'll need to be able to tell BuildForest which implementation to use
                
> Improbe the way to make the split point on DF.
> ----------------------------------------------
>
>                 Key: MAHOUT-943
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-943
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Ikumasa Mukai
>              Labels: DecisionForest
>
> The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.
> But I think this is a little too strict and think it is better on some situation to  use the average value which is calced with the best IG value and the 2nd value.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-943) Improbe the way to make the split point on DF.

Posted by "Wang Yue (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184004#comment-13184004 ] 

Wang Yue commented on MAHOUT-943:
---------------------------------

Hi, Mukai
  Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
"
 private static double variance(double[] s, double[] ss, double[] dataSize) {
    double var = 0;
    for (int i = 0; i < s.length; i++) {
      if (dataSize[i] > 0) {
        var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
      }
    }
    return var;
  }
"

While the variance in my mind should be something like 
var += ss[i]/dataSize[i] - ((s[i] * s[i]) / dataSize[i]*dataSize[i]);

Please help correct me if I am wrong. Thanks
                
> Improbe the way to make the split point on DF.
> ----------------------------------------------
>
>                 Key: MAHOUT-943
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-943
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Ikumasa Mukai
>              Labels: DecisionForest
>
> The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.
> But I think this is a little too strict and think it is better on some situation to  use the average value which is calced with the best IG value and the 2nd value.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-943) Improbe the way to make the split point on DF.

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184024#comment-13184024 ] 

Ted Dunning commented on MAHOUT-943:
------------------------------------

Also, that isn't a particularly good way to compute variance in the first place.

Better to use Welford's method.  Better, use something like the OnlineSummarizer.


                
> Improbe the way to make the split point on DF.
> ----------------------------------------------
>
>                 Key: MAHOUT-943
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-943
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Ikumasa Mukai
>              Labels: DecisionForest
>
> The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.
> But I think this is a little too strict and think it is better on some situation to  use the average value which is calced with the best IG value and the 2nd value.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-943) Improbe the way to make the split point on DF.

Posted by "Ikumasa Mukai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ikumasa Mukai updated MAHOUT-943:
---------------------------------

    Attachment: MAHOUT-943.patch

I made a patch.

Following Deneche-san's advice, I added a mechanism to change the config of TreeBuilder with xml.

{noformat}
<?xml version="1.0"?>
<configuration>
  <treeBuilder class="org.apache.mahout.classifier.df.builder.DecisionTreeBuilder">
    <igSplit class="org.apache.mahout.classifier.df.split.ClassificationSplit"/>
    <m>5</m>
  </treeBuilder>
</configuration>
{noformat}

ClassificationSplit class is a sample splitter which uses the average value for the point.

{noformat}
./hadoop jar $MAHOUT_HOME/mahout-examples-0.6-SNAPSHOT-job.jar \
org.apache.mahout.classifier.df.mapreduce.BuildForest \
-Dmapred.max.split.size=1874231 \
-d $KDD_DATA/KDDTrain.data \
-ds $KDD_DATA/KDDTrain+.info \
-c $MAHOUT_HOME/conf/df-config.xml \
-p -t 100 -o $KDD_DATA/model
{noformat}

I added "-c" param on BuildForest. This param should pointto the conf(XML) file.
                
> Improbe the way to make the split point on DF.
> ----------------------------------------------
>
>                 Key: MAHOUT-943
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-943
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Ikumasa Mukai
>              Labels: DecisionForest
>         Attachments: MAHOUT-943.patch
>
>
> The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.
> But I think this is a little too strict and think it is better on some situation to  use the average value which is calced with the best IG value and the 2nd value.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-943) Improbe the way to make the split point on DF.

Posted by "Ikumasa Mukai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186179#comment-13186179 ] 

Ikumasa Mukai commented on MAHOUT-943:
--------------------------------------

I posted a patch for Regressionsplit.java on MAHOUT-945
because this issue (943) is for classification method.
                
> Improbe the way to make the split point on DF.
> ----------------------------------------------
>
>                 Key: MAHOUT-943
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-943
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Ikumasa Mukai
>              Labels: DecisionForest
>
> The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.
> But I think this is a little too strict and think it is better on some situation to  use the average value which is calced with the best IG value and the 2nd value.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-943) Improbe the way to make the split point on DF.

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184032#comment-13184032 ] 

Sean Owen commented on MAHOUT-943:
----------------------------------

Or RunningAverageAndStdDev does this too
                
> Improbe the way to make the split point on DF.
> ----------------------------------------------
>
>                 Key: MAHOUT-943
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-943
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Ikumasa Mukai
>              Labels: DecisionForest
>
> The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.
> But I think this is a little too strict and think it is better on some situation to  use the average value which is calced with the best IG value and the 2nd value.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira