You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2017/06/07 08:17:18 UTC

[jira] [Commented] (SPARK-21005) VectorIndexerModel does not prepare output column field correctly

    [ https://issues.apache.org/jira/browse/SPARK-21005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040478#comment-16040478 ] 

Apache Spark commented on SPARK-21005:
--------------------------------------

User 'hibayesian' has created a pull request for this issue:
https://github.com/apache/spark/pull/18227

> VectorIndexerModel does not prepare output column field correctly
> -----------------------------------------------------------------
>
>                 Key: SPARK-21005
>                 URL: https://issues.apache.org/jira/browse/SPARK-21005
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 2.1.1
>            Reporter: Chen Lin
>
> From my understanding through reading the documentation,  VectorIndexer decides which features should be categorical based on the number of distinct values, where features with at most maxCategories are declared categorical. Meanwhile, those features which exceed maxCategories are declared continuous. 
> Currently, VectorIndexerModel works all right with a dataset which has empty schema. However, when VectorIndexerModel is transforming on a dataset with `ML_ATTR` metadata, it may not output the expected result. For example, a feature with nominal attribute which has distinct values exceeding maxCategorie will not be treated as a continuous feature as we expected but still a categorical feature. Thus, it may cause all the tree-based algorithms (like Decision Tree, Random Forest, GBDT, etc.) throw errors as "DecisionTree requires maxBins (= $maxPossibleBins) to be at least as large as the number of values in each categorical feature, but categorical feature $maxCategory has $maxCategoriesPerFeature values. Considering remove this and other categorical features with a large number of values, or add more training examples.".
> Correct me if my understanding is wrong.
> I will submit a PR soon to solve this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org