You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "AnChe Kuo (JIRA)" <ji...@apache.org> on 2017/09/26 00:01:44 UTC
[jira] [Closed] (SPARK-22034) CrossValidator's training and testing
set with different set of labels, resulting in encoder transform error
[ https://issues.apache.org/jira/browse/SPARK-22034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
AnChe Kuo closed SPARK-22034.
-----------------------------
Resolution: Later
> CrossValidator's training and testing set with different set of labels, resulting in encoder transform error
> ------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-22034
> URL: https://issues.apache.org/jira/browse/SPARK-22034
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 2.2.0
> Environment: Ubuntu 16.04
> Scala 2.11
> Spark 2.2.0
> Reporter: AnChe Kuo
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> Let's say we have a VectorIndexer with maxCategories set to 13, and training set has a column containing month label.
> In CrossValidator, dataframe is split into training and testing set automatically. If could happen that training set happens to lack month 2 (could happen by chance, or happen quite frequently if we have unbalanced label).
> When training set is being trained within the cross validator, the pipeline is fitted with the training set only, resulting in a partial key map in VectorIndexer. When this pipeline is used to transform the predict set, VectorIndexer will throw a "key not found" error.
> Making CrossValidator also an estimator thus can be connected to a whole pipeline is a cool idea, but bug like this occurs, and is not expected.
> The solution, I am guessing, would be to check each stage in the pipeline, and when we see encoder type stage, we fit the stage model with the complete dataset.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org