You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Oleg Levchenko (JIRA)" <ji...@apache.org> on 2011/05/26 09:37:47 UTC

[jira] [Created] (MAHOUT-713) Random Forest Prototypes

Random Forest Prototypes
------------------------

                 Key: MAHOUT-713
                 URL: https://issues.apache.org/jira/browse/MAHOUT-713
             Project: Mahout
          Issue Type: New Feature
          Components: Classification
            Reporter: Oleg Levchenko
            Priority: Minor


Below is an explanation by Breinman (http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#prototype):

Prototypes are a way of getting a picture of how the variables relate to the classification. 

For the jth class, we find the case that has the largest number of class j cases among its k nearest neighbors, determined using the proximities. Among these k cases we find the median, 25th percentile, and 75th percentile for each variable. 

The medians are the prototype for class j and the quartiles give an estimate of is stability. 

For the second prototype, we repeat the procedure but only consider cases that are not among the original k, and so on. 

Prototypes for continuous variables are standardized by subtractng the 5th percentile and dividing by the difference between the 95th and 5th percentiles. 

For categorical variables, the prototype is the most frequent value.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Resolved] (MAHOUT-713) Random Forest Prototypes

Posted by deneche abdelhakim <ad...@gmail.com>.
Yeah, MAHOUT-835 removed the callbacks from the code, so the suggested
implementation is not relevant any more.

On Sat, Oct 15, 2011 at 11:15 AM, Sean Owen (Resolved) (JIRA) <
jira@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Sean Owen resolved MAHOUT-713.
> ------------------------------
>
>    Resolution: Won't Fix
>
> I don't think anyone has come forward to work on this, and have no evidence
> that will happen. Marking WontFix for now.
>
> > Random Forest Prototypes
> > ------------------------
> >
> >                 Key: MAHOUT-713
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-713
> >             Project: Mahout
> >          Issue Type: New Feature
> >          Components: Classification
> >            Reporter: Oleg Levchenko
> >            Priority: Minor
> >
> > Below is an explanation by Breinman (
> http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#prototype
> ):
> > Prototypes are a way of getting a picture of how the variables relate to
> the classification.
> > For the jth class, we find the case that has the largest number of class
> j cases among its k nearest neighbors, determined using the proximities.
> Among these k cases we find the median, 25th percentile, and 75th percentile
> for each variable.
> > The medians are the prototype for class j and the quartiles give an
> estimate of is stability.
> > For the second prototype, we repeat the procedure but only consider cases
> that are not among the original k, and so on.
> > Prototypes for continuous variables are standardized by subtractng the
> 5th percentile and dividing by the difference between the 95th and 5th
> percentiles.
> > For categorical variables, the prototype is the most frequent value.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

[jira] [Issue Comment Edited] (MAHOUT-713) Random Forest Prototypes

Posted by "Oleg Levchenko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039573#comment-13039573 ] 

Oleg Levchenko edited comment on MAHOUT-713 at 5/26/11 7:54 AM:
----------------------------------------------------------------

ok, effectively I am suggesting to augment package of callbacks (org.apache.mahout.df.callback) with a couple of additional callbacks - one for collating inter case (aka "instances" in org.apache.mahout.df.callback.ForestPredictions) proximities matrix, and the second one for extraction of prototypes based on proximities matrix.

Should I amend Description of ticket or this comment is just fine?

      was (Author: u35tpus):
    ok, effectively I am suggesting augment package of callbacks (org.apache.mahout.df.callback) with a couple of additional callbacks - one for collating inter case (aka "instances" in org.apache.mahout.df.callback.ForestPredictions) proximities matrix, and the second one for extraction of prototypes nased on proximities matrix.

Should I amend Description of ticket or this comment is just fine?
  
> Random Forest Prototypes
> ------------------------
>
>                 Key: MAHOUT-713
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-713
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Oleg Levchenko
>            Priority: Minor
>
> Below is an explanation by Breinman (http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#prototype):
> Prototypes are a way of getting a picture of how the variables relate to the classification. 
> For the jth class, we find the case that has the largest number of class j cases among its k nearest neighbors, determined using the proximities. Among these k cases we find the median, 25th percentile, and 75th percentile for each variable. 
> The medians are the prototype for class j and the quartiles give an estimate of is stability. 
> For the second prototype, we repeat the procedure but only consider cases that are not among the original k, and so on. 
> Prototypes for continuous variables are standardized by subtractng the 5th percentile and dividing by the difference between the 95th and 5th percentiles. 
> For categorical variables, the prototype is the most frequent value.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-713) Random Forest Prototypes

Posted by "Sean Owen (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-713.
------------------------------

    Resolution: Won't Fix

I don't think anyone has come forward to work on this, and have no evidence that will happen. Marking WontFix for now.
                
> Random Forest Prototypes
> ------------------------
>
>                 Key: MAHOUT-713
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-713
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Oleg Levchenko
>            Priority: Minor
>
> Below is an explanation by Breinman (http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#prototype):
> Prototypes are a way of getting a picture of how the variables relate to the classification. 
> For the jth class, we find the case that has the largest number of class j cases among its k nearest neighbors, determined using the proximities. Among these k cases we find the median, 25th percentile, and 75th percentile for each variable. 
> The medians are the prototype for class j and the quartiles give an estimate of is stability. 
> For the second prototype, we repeat the procedure but only consider cases that are not among the original k, and so on. 
> Prototypes for continuous variables are standardized by subtractng the 5th percentile and dividing by the difference between the 95th and 5th percentiles. 
> For categorical variables, the prototype is the most frequent value.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-713) Random Forest Prototypes

Posted by "Oleg Levchenko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039573#comment-13039573 ] 

Oleg Levchenko commented on MAHOUT-713:
---------------------------------------

ok, effectively I am suggesting augment package of callbacks (org.apache.mahout.df.callback) with a couple of additional callbacks - one for collating inter case (aka "instances" in org.apache.mahout.df.callback.ForestPredictions) proximities matrix, and the second one for extraction of prototypes nased on proximities matrix.

Should I amend Description of ticket or this comment is just fine?

> Random Forest Prototypes
> ------------------------
>
>                 Key: MAHOUT-713
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-713
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Oleg Levchenko
>            Priority: Minor
>
> Below is an explanation by Breinman (http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#prototype):
> Prototypes are a way of getting a picture of how the variables relate to the classification. 
> For the jth class, we find the case that has the largest number of class j cases among its k nearest neighbors, determined using the proximities. Among these k cases we find the median, 25th percentile, and 75th percentile for each variable. 
> The medians are the prototype for class j and the quartiles give an estimate of is stability. 
> For the second prototype, we repeat the procedure but only consider cases that are not among the original k, and so on. 
> Prototypes for continuous variables are standardized by subtractng the 5th percentile and dividing by the difference between the 95th and 5th percentiles. 
> For categorical variables, the prototype is the most frequent value.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-713) Random Forest Prototypes

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039569#comment-13039569 ] 

Sean Owen commented on MAHOUT-713:
----------------------------------

Right, of course, but, this is not stating a particular change to Mahout. I understand it to be "implement this", but would be good to specify more concretely how and where this idea might live.

> Random Forest Prototypes
> ------------------------
>
>                 Key: MAHOUT-713
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-713
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Oleg Levchenko
>            Priority: Minor
>
> Below is an explanation by Breinman (http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#prototype):
> Prototypes are a way of getting a picture of how the variables relate to the classification. 
> For the jth class, we find the case that has the largest number of class j cases among its k nearest neighbors, determined using the proximities. Among these k cases we find the median, 25th percentile, and 75th percentile for each variable. 
> The medians are the prototype for class j and the quartiles give an estimate of is stability. 
> For the second prototype, we repeat the procedure but only consider cases that are not among the original k, and so on. 
> Prototypes for continuous variables are standardized by subtractng the 5th percentile and dividing by the difference between the 95th and 5th percentiles. 
> For categorical variables, the prototype is the most frequent value.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-713) Random Forest Prototypes

Posted by "Oleg Levchenko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039567#comment-13039567 ] 

Oleg Levchenko commented on MAHOUT-713:
---------------------------------------

No, feature request

> Random Forest Prototypes
> ------------------------
>
>                 Key: MAHOUT-713
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-713
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Oleg Levchenko
>            Priority: Minor
>
> Below is an explanation by Breinman (http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#prototype):
> Prototypes are a way of getting a picture of how the variables relate to the classification. 
> For the jth class, we find the case that has the largest number of class j cases among its k nearest neighbors, determined using the proximities. Among these k cases we find the median, 25th percentile, and 75th percentile for each variable. 
> The medians are the prototype for class j and the quartiles give an estimate of is stability. 
> For the second prototype, we repeat the procedure but only consider cases that are not among the original k, and so on. 
> Prototypes for continuous variables are standardized by subtractng the 5th percentile and dividing by the difference between the 95th and 5th percentiles. 
> For categorical variables, the prototype is the most frequent value.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-713) Random Forest Prototypes

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039562#comment-13039562 ] 

Sean Owen commented on MAHOUT-713:
----------------------------------

(Is this an issue report?)

> Random Forest Prototypes
> ------------------------
>
>                 Key: MAHOUT-713
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-713
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Oleg Levchenko
>            Priority: Minor
>
> Below is an explanation by Breinman (http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#prototype):
> Prototypes are a way of getting a picture of how the variables relate to the classification. 
> For the jth class, we find the case that has the largest number of class j cases among its k nearest neighbors, determined using the proximities. Among these k cases we find the median, 25th percentile, and 75th percentile for each variable. 
> The medians are the prototype for class j and the quartiles give an estimate of is stability. 
> For the second prototype, we repeat the procedure but only consider cases that are not among the original k, and so on. 
> Prototypes for continuous variables are standardized by subtractng the 5th percentile and dividing by the difference between the 95th and 5th percentiles. 
> For categorical variables, the prototype is the most frequent value.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira