You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Rahul Tanwani (JIRA)" <ji...@apache.org> on 2016/04/13 21:18:25 UTC

[jira] [Created] (SPARK-14606) Different maxBins value for categorical and continuous features in RandomForest implementation.

Rahul Tanwani created SPARK-14606:
-------------------------------------

             Summary: Different maxBins value for categorical and continuous features in RandomForest implementation.
                 Key: SPARK-14606
                 URL: https://issues.apache.org/jira/browse/SPARK-14606
             Project: Spark
          Issue Type: Improvement
          Components: ML, MLlib
    Affects Versions: 1.6.1, 1.6.0, 1.5.2
            Reporter: Rahul Tanwani
            Priority: Minor
             Fix For: 2.0.0


Currently the RandomForest algo takes a single maxBins value to decide the number of splits to take. This sometimes causes training time to go very high when there is a single categorical column having sufficiently large number of unique values. This single column impacts all the numeric (continuous) columns even though such a high number of splits are not required. 

Encoding the  categorical column into features make the data very wide and this requires us to increase the maxMemoryInMB and puts more pressure on the GC as well. 

Keeping the separate maxBins values for categorial and continuous features should be useful in this regard. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org