You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by OBones <ob...@free.fr> on 2017/06/27 15:07:22 UTC

[ML] Stop conditions for RandomForest

Hello,

Reading around on the theory behind tree based regression, I concluded 
that there are various reasons to stop exploring the tree when a given 
node has been reached. Among these, I have those two:

1. When starting to process a node, if its size (row count) is less than 
X then consider it a leaf
2. When a split for a node is considered, if any side of the split has 
its size less than Y, then ignore it when selecting the best split

As an example, let's consider a node with 45 rows, that for a given 
split creates two children, containing 5 and 35 rows respectively.

If I set X to 50, then the node is a leaf and no split is attempted
if I set X to 10 and Y to 15, then the splits are computed but because 
one of them has less than 15 rows, that split is ignored.

I'm using DecisionTreeRegressor and RandomForestRegressor on our data 
and because the former is implemented using the latter, they both share 
the same parameters.
Going through those parameters, I found minInstancesPerNode which to me 
is the Y value, but I could not find any parameter for the X value.
Have I missed something?
If not, would there be a way to implement this?

Regards



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: [ML] Stop conditions for RandomForest

Posted by OBones <ob...@free.fr>.
To me, they are.
Y is used to control if a split is a valid candidate when deciding which 
one to follow.
X is used to make a node a leaf if it has too few elements to even 
consider candidate splits.

颜发才(Yan Facai) wrote:
> It seems that split will always stop when count of nodes is less than 
> max(X, Y).
> Hence, are they different?
>
>
>
> On Tue, Jun 27, 2017 at 11:07 PM, OBones <obones@free.fr 
> <ma...@free.fr>> wrote:
>
>     Hello,
>
>     Reading around on the theory behind tree based regression, I
>     concluded that there are various reasons to stop exploring the
>     tree when a given node has been reached. Among these, I have those
>     two:
>
>     1. When starting to process a node, if its size (row count) is
>     less than X then consider it a leaf
>     2. When a split for a node is considered, if any side of the split
>     has its size less than Y, then ignore it when selecting the best split
>
>     As an example, let's consider a node with 45 rows, that for a
>     given split creates two children, containing 5 and 35 rows
>     respectively.
>
>     If I set X to 50, then the node is a leaf and no split is attempted
>     if I set X to 10 and Y to 15, then the splits are computed but
>     because one of them has less than 15 rows, that split is ignored.
>
>     I'm using DecisionTreeRegressor and RandomForestRegressor on our
>     data and because the former is implemented using the latter, they
>     both share the same parameters.
>     Going through those parameters, I found minInstancesPerNode which
>     to me is the Y value, but I could not find any parameter for the X
>     value.
>     Have I missed something?
>     If not, would there be a way to implement this?
>
>     Regards
>
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>     <ma...@spark.apache.org>
>
>


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: [ML] Stop conditions for RandomForest

Posted by "颜发才 (Yan Facai)" <fa...@gmail.com>.
It seems that split will always stop when count of nodes is less than
max(X, Y).
Hence, are they different?



On Tue, Jun 27, 2017 at 11:07 PM, OBones <ob...@free.fr> wrote:

> Hello,
>
> Reading around on the theory behind tree based regression, I concluded
> that there are various reasons to stop exploring the tree when a given node
> has been reached. Among these, I have those two:
>
> 1. When starting to process a node, if its size (row count) is less than X
> then consider it a leaf
> 2. When a split for a node is considered, if any side of the split has its
> size less than Y, then ignore it when selecting the best split
>
> As an example, let's consider a node with 45 rows, that for a given split
> creates two children, containing 5 and 35 rows respectively.
>
> If I set X to 50, then the node is a leaf and no split is attempted
> if I set X to 10 and Y to 15, then the splits are computed but because one
> of them has less than 15 rows, that split is ignored.
>
> I'm using DecisionTreeRegressor and RandomForestRegressor on our data and
> because the former is implemented using the latter, they both share the
> same parameters.
> Going through those parameters, I found minInstancesPerNode which to me is
> the Y value, but I could not find any parameter for the X value.
> Have I missed something?
> If not, would there be a way to implement this?
>
> Regards
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>