You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Wang Yue (Created) (JIRA)" <ji...@apache.org> on 2012/01/11 13:01:40 UTC

[jira] [Created] (MAHOUT-945) The variance calculation of Random forest regression tree

The variance calculation of Random forest regression tree
---------------------------------------------------------

                 Key: MAHOUT-945
                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
             Project: Mahout
          Issue Type: Improvement
          Components: Classification
    Affects Versions: 0.6
            Reporter: Wang Yue


Hi, Mukai
  Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
"
 private static double variance(double[] s, double[] ss, double[] dataSize) {
    double var = 0;
    for (int i = 0; i < s.length; i++) {
      if (dataSize[i] > 0) {
        var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
      }
    }
    return var;
  }
"

While the variance in my mind should be something like 
var += ss[i]/dataSize[i] - ((s[i] * s[i]) / dataSize[i]*dataSize[i]);

Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Commented] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by IKumasa Mukai <ik...@gmail.com>.
Hi Ted-san.

I made a patch using Welford's method which you advised, not Weighted
incremental algorithm.

And now the duplicate code is being checked to merge with
FullRunningAverageAndStdDev.

Thanks,

2012/1/16 Ted Dunning <te...@gmail.com>:
> WHy not just use an OnlineAccumulator?  Why duplicate code?
>
> On Sun, Jan 15, 2012 at 11:59 AM, Wang Yue (Commented) (JIRA) <
> jira@apache.org> wrote:
>
>>
>>    [
>> https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186485#comment-13186485]
>>
>> Wang Yue commented on MAHOUT-945:
>> ---------------------------------
>>
>> Hi, Ikumaso Mukai,
>>  Thanks for your improvement, I realize that you actually implement the
>> new online version of variance calculation according to
>> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance,
>> however, the problem I indicate still exists, that is, the final variance
>> should divide by n(which is sample size.) So, I would suggest to modify the
>> third last line of following code, do you think so?
>>
>> +  /**
>> +   * Calculator for variance calculation
>> +   */
>> +  private static class VarianceCalculator {
>> +
>> +    private int n;
>> +    private double mean;
>> +    private double var;
>> +
>> +    void add(double value) {
>> +      n++;
>> +      double oldMean = mean;
>> +      mean += (value - mean) / n;
>> +      double diff = (value - mean) * (value - oldMean);
>> +      var += diff;
>> +    }
>> +
>> +    double getVariance() {
>> +      return var/n;   //// suggested by Wang Yue
>>
>> +    }
>> +  }
>>
>> > The variance calculation of Random forest regression tree
>> > ---------------------------------------------------------
>> >
>> >                 Key: MAHOUT-945
>> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>> >             Project: Mahout
>> >          Issue Type: Improvement
>> >          Components: Classification
>> >    Affects Versions: 0.6
>> >            Reporter: Wang Yue
>> >              Labels: Regressionsplit.java
>> >         Attachments: MAHOUT-945.patch
>> >
>> >   Original Estimate: 48h
>> >  Remaining Estimate: 48h
>> >
>> > Hi, Mukai
>> >   Thanks for your efforts in expand the RF to regression. However, I
>> have a doubt about your implementation regarding to Regressionsplit.java.
>> The variance method
>> > "
>> >  private static double variance(double[] s, double[] ss, double[]
>> dataSize) {
>> >     double var = 0;
>> >     for (int i = 0; i < s.length; i++) {
>> >       if (dataSize[i] > 0) {
>> >         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>> >       }
>> >     }
>> >     return var;
>> >   }
>> > "
>> > While the variance in my mind should be something like
>> > var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
>> > Please help correct me if I am wrong. Thanks
>>
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA
>> administrators:
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>>
>>



-- 
- - - - - - -
IKumasa Mukai at Recruit Co.,Ltd.

Re: [jira] [Commented] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by Ted Dunning <te...@gmail.com>.
WHy not just use an OnlineAccumulator?  Why duplicate code?

On Sun, Jan 15, 2012 at 11:59 AM, Wang Yue (Commented) (JIRA) <
jira@apache.org> wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186485#comment-13186485]
>
> Wang Yue commented on MAHOUT-945:
> ---------------------------------
>
> Hi, Ikumaso Mukai,
>  Thanks for your improvement, I realize that you actually implement the
> new online version of variance calculation according to
> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance,
> however, the problem I indicate still exists, that is, the final variance
> should divide by n(which is sample size.) So, I would suggest to modify the
> third last line of following code, do you think so?
>
> +  /**
> +   * Calculator for variance calculation
> +   */
> +  private static class VarianceCalculator {
> +
> +    private int n;
> +    private double mean;
> +    private double var;
> +
> +    void add(double value) {
> +      n++;
> +      double oldMean = mean;
> +      mean += (value - mean) / n;
> +      double diff = (value - mean) * (value - oldMean);
> +      var += diff;
> +    }
> +
> +    double getVariance() {
> +      return var/n;   //// suggested by Wang Yue
>
> +    }
> +  }
>
> > The variance calculation of Random forest regression tree
> > ---------------------------------------------------------
> >
> >                 Key: MAHOUT-945
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Classification
> >    Affects Versions: 0.6
> >            Reporter: Wang Yue
> >              Labels: Regressionsplit.java
> >         Attachments: MAHOUT-945.patch
> >
> >   Original Estimate: 48h
> >  Remaining Estimate: 48h
> >
> > Hi, Mukai
> >   Thanks for your efforts in expand the RF to regression. However, I
> have a doubt about your implementation regarding to Regressionsplit.java.
> The variance method
> > "
> >  private static double variance(double[] s, double[] ss, double[]
> dataSize) {
> >     double var = 0;
> >     for (int i = 0; i < s.length; i++) {
> >       if (dataSize[i] > 0) {
> >         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
> >       }
> >     }
> >     return var;
> >   }
> > "
> > While the variance in my mind should be something like
> > var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> > Please help correct me if I am wrong. Thanks
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

[jira] [Commented] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by "Ikumasa Mukai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284596#comment-13284596 ] 

Ikumasa Mukai commented on MAHOUT-945:
--------------------------------------

I have checked the latest patch can be applied to revision 1235053.

                
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch, MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by "Wang Yue (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13201399#comment-13201399 ] 

Wang Yue commented on MAHOUT-945:
---------------------------------

Hi, Mukai
  Thanks for your efforts. I feel it is ok for your modification. I have a
question about the decision tree building.
  Do you know the meaning of  option "complementary" using mahout to build
the random forest?

On Mon, Jan 23, 2012 at 6:55 AM, Ikumasa Mukai (Updated) (JIRA) <



-- 
Regards, Wang Yue
PhD Starts From 08 Fall
NUS  Graduate School for Integrative Sciences and Engineering, NUS
Email: wangyue@nus.edu.sg
Homepage: https://sites.google.com/site/fayue1015/
HP:    +65 81022515

                
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch, MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399411#comment-13399411 ] 

Sean Owen commented on MAHOUT-945:
----------------------------------

I can take this up. I don't understand the need for the change in FullRunningAverageAndStdDev. If you don't want to calculate stdev, use FullRunningAverage.
                
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch, MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by "Ikumasa Mukai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ikumasa Mukai updated MAHOUT-945:
---------------------------------

    Attachment: MAHOUT-945.patch

Hi
I made a new patch which has the Wang-san's point. Thank you Wang-san.

On this, I adopt using FullRunningAverageAndStdDev instead of the own code for calculating the variances.

And for the performance, this patch has the modification on FullRunningAverageAndStdDev.

It is nice if you would check whether the modification is acceptable.

Regards,
                
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch, MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by "Wang Yue (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187920#comment-13187920 ] 

Wang Yue commented on MAHOUT-945:
---------------------------------

Hi,
   Partly Yes, I would say my point is that when we calculate the variance, we should divide the size "n", otherwise, it will cause errors.
   When I ask why you did not divide the size n, your answer is that it will cause rounding error. 
   I would say this is not very sound. :(

   If you stick to use the total values or your original code, then calculation for the first split is 25.63-(1.67+37.5) = -13
   How do I get this formula?
   Recall that your original code did not divide the size n(here is 110.) so your code will obtain the total variance  as 25.63, first group variance is 1.67, second group variance is 37.5 while real variance is 25.63/110 = 0.233. 1.67/10=0.167, 37.5/100=0.375
   Hope this clarifies.
   
                
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by "Ikumasa Mukai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ikumasa Mukai updated MAHOUT-945:
---------------------------------

    Attachment: MAHOUT-945.patch

At Ted-san's suggestion,
I made a patch for using Welford's method to calc the variances.
                
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by "Wang Yue (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186485#comment-13186485 ] 

Wang Yue commented on MAHOUT-945:
---------------------------------

Hi, Ikumaso Mukai,
  Thanks for your improvement, I realize that you actually implement the new online version of variance calculation according to http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance, however, the problem I indicate still exists, that is, the final variance should divide by n(which is sample size.) So, I would suggest to modify the third last line of following code, do you think so?

+  /**
+   * Calculator for variance calculation
+   */
+  private static class VarianceCalculator {
+
+    private int n;
+    private double mean;
+    private double var;
+    
+    void add(double value) {
+      n++;
+      double oldMean = mean;
+      mean += (value - mean) / n;
+      double diff = (value - mean) * (value - oldMean);
+      var += diff;
+    }
+
+    double getVariance() {
+      return var/n;   //// suggested by Wang Yue

+    }
+  } 
                
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by "Ikumasa Mukai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186881#comment-13186881 ] 

Ikumasa Mukai commented on MAHOUT-945:
--------------------------------------

Hi wang-san.
Thank you for your comment and sorry for not replying your point.

But, for just building trees, I think it is better not to divide by "n".
Because it (dividing by "n") will produce rounding errors when calculating the gains
and this does not change the logic. 

It is grest if you check this.

Regards,
                
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by "Ikumasa Mukai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187883#comment-13187883 ] 

Ikumasa Mukai commented on MAHOUT-945:
--------------------------------------

Hi wang-san.
Thank you for your detailed explanation.

It is the point that we must use the weighted average, right?

If I stick to use the total values for calc on your example,
the 1st split should be calced like 25.63-(16.6*11+19*1.1) and 
the 2nd split should be 25.63-(1.67*11+37.5*1.1).

Could I catch your suggestion?
                
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by "Wang Yue (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187050#comment-13187050 ] 

Wang Yue commented on MAHOUT-945:
---------------------------------

Hi,
I would still doubt the correctness of not dividing by n because of round error. 

Please correct me if I am wrong.

Below is a counter example.
  
Here is how I illustrate 

generate 110 numbers, with two groups 
first group with 10 numbers 
t1=rnorm(10,1,1)
> var(t1)
[1] 1.667472

second group 100 numbers.
t2=rnorm(100,1,0.5)
> var(t2)
[1] 0.1928758

The overall variance 110 numbers
> t=c(t1,t2)
> var(t)
[1] 0.2339202

Above split represent first way of split the 110 numbers.
  
one group with 10,  variance 1.66  your calculated variance is 16.6
second group with 100, variance 0.19. your calculated variance is 19
overall variance 0.233 your calculated variance is  25.63    

your calculated variance  reduced = 25.63-(16.6+19) = ~10
real variance reduced is 0.233-(1.66+0.19) = -1.62

Second split 
> tt2=t2[1:10]
> var(tt2)
0.1673325
> tt1=c(t2[11:100],t1)
> var(tt1)
[1] 0.3757684

one group with 10,  variance 0.167  your calculated variance is 1.67
second group with 100, variance 0.375. your calculated variance is  37.5
overall variance  0.233  your calculated variance is  25.63

your calculated variance reduced  = 25.63-(1.67+37.5) = -13
0.233-(0.167+0.375) = -~0.3x

your program will choose the first split while the real split may be second split.


                
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by "Ikumasa Mukai (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186174#comment-13186174 ] 

Ikumasa Mukai edited comment on MAHOUT-945 at 1/14/12 11:59 AM:
----------------------------------------------------------------

At Ted-san's suggestion on MAHOUT-943,
I made a patch for using Welford's method to calc the variances.
                
      was (Author: ikumasa mukai):
    At Ted-san's suggestion,
I made a patch for using Welford's method to calc the variances.
                  
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by "Ikumasa Mukai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405805#comment-13405805 ] 

Ikumasa Mukai commented on MAHOUT-945:
--------------------------------------

Thank you for your checking.
OK. I had checked both methods and could understand the same result could be calced.
I will attach a new patch using FullRunningAverage on this issue.
                
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch, MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by "Ikumasa Mukai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ikumasa Mukai updated MAHOUT-945:
---------------------------------

    Attachment: MAHOUT-945.patch

I've made a new patch using FullRunningAverage with new test case.

                
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch, MAHOUT-945.patch, MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by "Ikumasa Mukai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ikumasa Mukai updated MAHOUT-945:
---------------------------------

    Status: Patch Available  (was: Open)
    
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by "Ikumasa Mukai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186196#comment-13186196 ] 

Ikumasa Mukai commented on MAHOUT-945:
--------------------------------------

Appreciate your quick checking.

Of cause I checked RunningAverageAndStdDev according your advice and I found we could use.

But for reducing the time consumption to calc the standard deviation, I didn't use.(for our regression logic, variance is only needed.)

How do you feel about it?

Regards,
                
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by "Ikumasa Mukai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202236#comment-13202236 ] 

Ikumasa Mukai commented on MAHOUT-945:
--------------------------------------

Hi Wang-san.
Thank you for your checking.

I think my patch will be committed if the modification on FullRunningAverageAndStdDev can be accepted.

Regards,
                
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch, MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186184#comment-13186184 ] 

Sean Owen commented on MAHOUT-945:
----------------------------------

That's good, but the new implementation just duplicates two other implementations of Welford's method in the code base. Can we not just use RunningAverageAndStdDev?
                
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>         Attachments: MAHOUT-945.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-945) The variance calculation of Random forest regression tree

Posted by "Wang Yue (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wang Yue updated MAHOUT-945:
----------------------------

    Description: 
Hi, Mukai
  Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
"
 private static double variance(double[] s, double[] ss, double[] dataSize) {
    double var = 0;
    for (int i = 0; i < s.length; i++) {
      if (dataSize[i] > 0) {
        var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
      }
    }
    return var;
  }
"

While the variance in my mind should be something like 
var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));

Please help correct me if I am wrong. Thanks

  was:
Hi, Mukai
  Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
"
 private static double variance(double[] s, double[] ss, double[] dataSize) {
    double var = 0;
    for (int i = 0; i < s.length; i++) {
      if (dataSize[i] > 0) {
        var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
      }
    }
    return var;
  }
"

While the variance in my mind should be something like 
var += ss[i]/dataSize[i] - ((s[i] * s[i]) / dataSize[i]*dataSize[i]);

Please help correct me if I am wrong. Thanks

    
> The variance calculation of Random forest regression tree
> ---------------------------------------------------------
>
>                 Key: MAHOUT-945
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-945
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: Wang Yue
>              Labels: Regressionsplit.java
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, Mukai
>   Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method 
> "
>  private static double variance(double[] s, double[] ss, double[] dataSize) {
>     double var = 0;
>     for (int i = 0; i < s.length; i++) {
>       if (dataSize[i] > 0) {
>         var += ss[i] - ((s[i] * s[i]) / dataSize[i]);
>       }
>     }
>     return var;
>   }
> "
> While the variance in my mind should be something like 
> var += ss[i]/dataSize[i] - ((s[i] * s[i]) / (dataSize[i]*dataSize[i]));
> Please help correct me if I am wrong. Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira