You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Rahul Tanwani (JIRA)" <ji...@apache.org> on 2016/04/13 21:18:25 UTC
[jira] [Created] (SPARK-14606) Different maxBins value for
categorical and continuous features in RandomForest implementation.
Rahul Tanwani created SPARK-14606:
-------------------------------------
Summary: Different maxBins value for categorical and continuous features in RandomForest implementation.
Key: SPARK-14606
URL: https://issues.apache.org/jira/browse/SPARK-14606
Project: Spark
Issue Type: Improvement
Components: ML, MLlib
Affects Versions: 1.6.1, 1.6.0, 1.5.2
Reporter: Rahul Tanwani
Priority: Minor
Fix For: 2.0.0
Currently the RandomForest algo takes a single maxBins value to decide the number of splits to take. This sometimes causes training time to go very high when there is a single categorical column having sufficiently large number of unique values. This single column impacts all the numeric (continuous) columns even though such a high number of splits are not required.
Encoding the categorical column into features make the data very wide and this requires us to increase the maxMemoryInMB and puts more pressure on the GC as well.
Keeping the separate maxBins values for categorial and continuous features should be useful in this regard.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org