You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Christian Reiniger (JIRA)" <ji...@apache.org> on 2017/03/24 11:37:42 UTC
[jira] [Created] (SPARK-20081) RandomForestClassifier doesn't seem
to support more than 100 labels
Christian Reiniger created SPARK-20081:
------------------------------------------
Summary: RandomForestClassifier doesn't seem to support more than 100 labels
Key: SPARK-20081
URL: https://issues.apache.org/jira/browse/SPARK-20081
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 2.1.0
Environment: Java
Reporter: Christian Reiniger
When feeding data with more than 100 labels into RanfomForestClassifer#fit() (from java code), I get the following error message:
{code}
Classifier inferred 143 from label values in column rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) allowed to be inferred from values.
To avoid this error for labels with > 100 classes, specify numClasses explicitly in the metadata; this can be done by applying StringIndexer to the label column.
{code}
Setting "numClasses" in the metadata for the label column doesn't make a difference. Looking at the code, this is not surprising, since MetadataUtils.getNumClasses() ignores this setting:
{code:language=scala}
def getNumClasses(labelSchema: StructField): Option[Int] = {
Attribute.fromStructField(labelSchema) match {
case binAttr: BinaryAttribute => Some(2)
case nomAttr: NominalAttribute => nomAttr.getNumValues
case _: NumericAttribute | UnresolvedAttribute => None
}
}
{code}
The alternative would be to pass a proper "maxNumClasses" parameter to the classifier, so that Classifier#getNumClasses() allows a larger number of auto-detected labels. However, RandomForestClassifer#train() calls #getNumClasses without the "maxNumClasses" parameter, causing it to use the default of 100:
{code:language=scala}
override protected def train(dataset: Dataset[_]): RandomForestClassificationModel = {
val categoricalFeatures: Map[Int, Int] =
MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
val numClasses: Int = getNumClasses(dataset)
// ...
{code}
My scala skills are pretty sketchy, so please correct me if I misinterpreted something. But as it seems right now, there is no way to learn from data with more than 100 labels via RandomForestClassifier.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org