You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/01/10 03:27:00 UTC

[jira] [Resolved] (SPARK-26579) SparkML DecisionTree, how does the algorithm identify categorical features?

     [ https://issues.apache.org/jira/browse/SPARK-26579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-26579.
----------------------------------
    Resolution: Invalid

> SparkML DecisionTree, how does the algorithm identify categorical features?
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-26579
>                 URL: https://issues.apache.org/jira/browse/SPARK-26579
>             Project: Spark
>          Issue Type: Question
>          Components: ML
>    Affects Versions: 2.4.0
>         Environment: os: Centos7
> software: pyspark.
>            Reporter: Xufeng Wang
>            Priority: Major
>
> I am confused about the decision tree and other tree based models. My current project involves data with both nominal and continuous features. I have converted the nominal data to continuous values using the StringIndexer transformer from the ml.feature module. Then I vector assembled all the feature values into a vector type column named features. The feature vector, as I see it, are all double datatype.
> While I keep getting the maxBins should be larger than the largest number for all categorical features error, as I correct the maxBins size, I still see some features (continuous type since the beginning) having the bigger than my maxBins size values. Since the pipeline works with correct maxBins that is not bigger than some continuous values, I should be able to say that the algorithm automatically pick which features are categorical and which ones are continuous. But how did it figure out which is which, as all of the features are of double datatype?
> Another question, if anyone can help, what is the tree type for spark decision tree. Is it CART or else?
> Last question, what are the procedures for treating categorical features in tree based algorithms.
> Thank you in advance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org