You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/01/28 10:20:00 UTC
[jira] [Commented] (SPARK-26738) Pyspark random forest classifier feature importance with column names

    [ https://issues.apache.org/jira/browse/SPARK-26738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753867#comment-16753867 ] 

Hyukjin Kwon commented on SPARK-26738:
--------------------------------------

Questions should go to mailing list. Let's ask a question there before filing an issue here. You could have a better answer there.

> Pyspark random forest classifier feature importance with column names
> ---------------------------------------------------------------------
>
>                 Key: SPARK-26738
>                 URL: https://issues.apache.org/jira/browse/SPARK-26738
>             Project: Spark
>          Issue Type: Question
>          Components: ML
>    Affects Versions: 2.3.2
>            Reporter: Praveen
>            Priority: Major
>              Labels: RandomForest, pyspark
>
> I am trying to plot the feature importances of random forest classifier with with column names. I am using Spark 2.3.2 and Pyspark.
> The input X is sentences and i am using tfidf (HashingTF + IDF) + StringIndexer to generate the feature vectors.
> I have included all the stages in a Pipeline
>  
> {code:java}
> regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, minTokenLength=minimum_token_size)
> hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=number_of_feature)
> idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col)
> indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name)
> converter = IndexToString(inputCol='prediction', outputCol="original_label", labels=indexer.fit(df).labels)
> feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer])
> estimator = RandomForestClassifier(labelCol=label_col, featuresCol=features_col, numTrees=100)
> pipeline = Pipeline(stages=[feature_pipeline, estimator, converter])
> model = pipeline.fit(df)
> {code}
> Generating the feature importances as
> {code:java}
> rdc = model.stages[-2]
> print (rdc.featureImportances)
> {code}
> So far so good, but when i try to map the feature importances to the feature columns as below
> {code:java}
> attrs = sorted((attr["idx"], attr["name"]) for attr in (chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values())))
> [(name, rdc.featureImportances[idx])
>    for idx, name in attrs
>    if dtModel_1.featureImportances[idx]]{code}
>  
> I get the key error on ml_attr
> {code:java}
> KeyError: 'ml_attr'{code}
> The printed the dictionary,
> {code:java}
> print (df_pred.schema["featurescol"].metadata){code}
> and it's empty {}
> Any thoughts on what I am doing wrong ? How can I getting feature importances to the columns names.
> Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org