You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2018/04/09 19:12:00 UTC
[jira] [Commented] (SPARK-21005) VectorIndexerModel does not prepare output column field correctly

    [ https://issues.apache.org/jira/browse/SPARK-21005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431079#comment-16431079 ] 

Joseph K. Bradley commented on SPARK-21005:
-------------------------------------------

I don't actually see why this is a problem: If a feature is categorical, we should not silently convert it to continuous.  To use a high-arity categorical feature in a decision tree, one should convert it to a different representation first, such as hashing to a set of bins with HashingTF.

That said, I do think we should clarify this behavior in the VectorIndexer docstring.  I know it's been a long time since you sent your PR, but would you want to update it to simply update the docs?  If you're busy now, I'd be happy to take it over though.  Thanks!

> VectorIndexerModel does not prepare output column field correctly
> -----------------------------------------------------------------
>
>                 Key: SPARK-21005
>                 URL: https://issues.apache.org/jira/browse/SPARK-21005
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 2.1.1
>            Reporter: Chen Lin
>            Priority: Major
>
> From my understanding through reading the documentation,  VectorIndexer decides which features should be categorical based on the number of distinct values, where features with at most maxCategories are declared categorical. Meanwhile, those features which exceed maxCategories are declared continuous. 
> Currently, VectorIndexerModel works all right with a dataset which has empty schema. However, when VectorIndexerModel is transforming on a dataset with `ML_ATTR` metadata, it may not output the expected result. For example, a feature with nominal attribute which has distinct values exceeding maxCategorie will not be treated as a continuous feature as we expected but still a categorical feature. Thus, it may cause all the tree-based algorithms (like Decision Tree, Random Forest, GBDT, etc.) throw errors as "DecisionTree requires maxBins (= $maxPossibleBins) to be at least as large as the number of values in each categorical feature, but categorical feature $maxCategory has $maxCategoriesPerFeature values. Considering remove this and other categorical features with a large number of values, or add more training examples.".
> Correct me if my understanding is wrong.
> I will submit a PR soon to resolve this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org