You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Waleed Esmail (JIRA)" <ji...@apache.org> on 2018/02/13 20:58:00 UTC

[jira] [Commented] (SPARK-23414) Plotting using matplotlib in MLlib pyspark

    [ https://issues.apache.org/jira/browse/SPARK-23414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16363025#comment-16363025 ] 

Waleed Esmail commented on SPARK-23414:
---------------------------------------

I am sorry, I didn't get it, what do you mean by "orthogonal"?!.

> Plotting using matplotlib in MLlib pyspark 
> -------------------------------------------
>
>                 Key: SPARK-23414
>                 URL: https://issues.apache.org/jira/browse/SPARK-23414
>             Project: Spark
>          Issue Type: Question
>          Components: MLlib
>    Affects Versions: 2.2.1
>            Reporter: Waleed Esmail
>            Priority: Major
>
> Dear MLlib experts,
> I just want to plot a fancy confusion matrix (true values vs predicted values) like the one produced by seaborn module in python, so I did the following:
> {code:java}
> labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(output)
> # Automatically identify categorical features, and index them.
> # We specify maxCategories so features with > 4 distinct values are treated as continuous.
> featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures").fit(output)
> # Split the data into training and test sets (30% held out for testing)
> (trainingData, testData) = output.randomSplit([0.7, 0.3])
> dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxDepth=15)
> # Chain indexers and tree in a Pipeline
> pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
> # Train model.  This also runs the indexers.
> model = pipeline.fit(trainingData)
> # Make predictions.
> predictions = model.transform(testData)
> predictionAndLabels = predictions.select("prediction", "indexedLabel")
> y_predicted = np.array(predictions.select("prediction").collect())
> y_test = np.array(predictions.select("indexedLabel").collect())
> from sklearn.metrics import confusion_matrix
> import matplotlib.ticker as ticker
> figcm, ax = plt.subplots()
> cm = confusion_matrix(y_test, y_predicted)
> # for normalization
> cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
> sns.heatmap(cm, square=True, annot=True, cbar=False)
> plt.xlabel('predication')
> plt.ylabel('true value')
> {code}
> is this the right way to do it?!. please note that I am new to Spark and MLlib
>  
> thank you in advance,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org