You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Les Selecky (JIRA)" <ji...@apache.org> on 2015/07/15 21:22:04 UTC

[jira] [Created] (SPARK-9075) DecisionTreeMetadata - setting maxPossibleBins to numExamples is incorrect.

Les Selecky created SPARK-9075:
----------------------------------

             Summary: DecisionTreeMetadata - setting maxPossibleBins to numExamples is incorrect. 
                 Key: SPARK-9075
                 URL: https://issues.apache.org/jira/browse/SPARK-9075
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 1.4.0
            Reporter: Les Selecky


In https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala there's a statement that sets maxPossibileBins to numExamples when numExamples is less than strategy.maxBins. 

This can cause an error when training small partitions; the error is triggered further down in the logic where it's required that maxCategoriesPerFeature be less than or equal to maxPossibleBins.

Here's the an example of how it was manifested: the partition contained 49 rows (i.e., numExamples=49 but strategy.maxBins was 57.

The maxPossibleBins = math.min(strategy.maxBins, numExamples) logic therefore reduced maxPossibleBins to 49 causing the "require(maxCategoriesPerFeature <= maxPossibleBins" to throw an error.

In short, this will be a problem when training small datasets with a feature that contains more categories than numExamples.

In our local testing we commented out the "math.min(strategy.maxBins, numExamples)" line and the decision tree succeeded where it had failed previously.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org