You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Rahul Tanwani <ta...@gmail.com> on 2016/04/11 11:06:59 UTC

Different maxBins value for categorical and continuous features in RandomForest implementation.

Hi,

Currently the RandomForest algo takes a single maxBins value to decide the
number of splits to take. This sometimes causes training time to go very
high when there is a single categorical column having sufficiently large
number of unique values. This single column impacts all the numeric
(continuous) columns even though such a high number of splits are not
required.

Encoding the  categorical column into features make the data very wide and
this requires us to increase the maxMemoryInMB and puts more pressure on the
GC as well.

Keeping the separate maxBins values for categorial and continuous features
should be useful in this regard.




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Different-maxBins-value-for-categorical-and-continuous-features-in-RandomForest-implementation-tp17099.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Different maxBins value for categorical and continuous features in RandomForest implementation.

Posted by Rahul Tanwani <ta...@gmail.com>.

Added https://issues.apache.org/jira/browse/SPARK-14606



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Different-maxBins-value-for-categorical-and-continuous-features-in-RandomForest-implementation-tp17099p17123.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Different maxBins value for categorical and continuous features in RandomForest implementation.

Posted by Joseph Bradley <jo...@databricks.com>.

That sounds useful.  Would you mind creating a JIRA for it?  Thanks!
Joseph

On Mon, Apr 11, 2016 at 2:06 AM, Rahul Tanwani <ta...@gmail.com>
wrote:

> Hi,
>
> Currently the RandomForest algo takes a single maxBins value to decide the
> number of splits to take. This sometimes causes training time to go very
> high when there is a single categorical column having sufficiently large
> number of unique values. This single column impacts all the numeric
> (continuous) columns even though such a high number of splits are not
> required.
>
> Encoding the  categorical column into features make the data very wide and
> this requires us to increase the maxMemoryInMB and puts more pressure on
> the
> GC as well.
>
> Keeping the separate maxBins values for categorial and continuous features
> should be useful in this regard.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Different-maxBins-value-for-categorical-and-continuous-features-in-RandomForest-implementation-tp17099.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>