You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Harutaka Kawamura (Jira)" <ji...@apache.org> on 2021/04/20 01:30:00 UTC
[jira] [Updated] (SPARK-35142) `OneVsRest` classifier uses
incorrect data type for `rawPrediction` column
[ https://issues.apache.org/jira/browse/SPARK-35142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Harutaka Kawamura updated SPARK-35142:
--------------------------------------
Description:
Code to reproduce the issue:
from pyspark.ml.classification import LogisticRegression, OneVsRest
from pyspark.ml.linalg import Vectors
from pyspark.sql import SparkSession
from sklearn.datasets import load_irisspark = SparkSession.builder.getOrCreate()
X, y = load_iris(return_X_y=True)
df = spark.createDataFrame(
[(Vectors.dense(features), int(label)) for features, label in zip(X, y)], ["features", "label"]
)
train, test = df.randomSplit([0.8, 0.2])lor = LogisticRegression(maxIter=5)
ovr = OneVsRest(classifier=lor)ovrModel = ovr.fit(train)
pred = ovrModel.transform(test)pred.show()
# the line above fails with:
# net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict
# (for pyspark.ml.linalg.DenseVector)# ====================
# INVESTIGATION
# ====================pred.printSchema()
# root
# |-- features: vector (nullable = true)
# |-- label: long (nullable = true)
# |-- rawPrediction: string (nullable = true) <- why not numeric?
# |-- prediction: double (nullable = true)
> `OneVsRest` classifier uses incorrect data type for `rawPrediction` column
> --------------------------------------------------------------------------
>
> Key: SPARK-35142
> URL: https://issues.apache.org/jira/browse/SPARK-35142
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 3.0.0, 3.0.2, 3.1.0, 3.1.1
> Reporter: Harutaka Kawamura
> Priority: Major
>
> Code to reproduce the issue:
> from pyspark.ml.classification import LogisticRegression, OneVsRest
> from pyspark.ml.linalg import Vectors
> from pyspark.sql import SparkSession
> from sklearn.datasets import load_irisspark = SparkSession.builder.getOrCreate()
> X, y = load_iris(return_X_y=True)
> df = spark.createDataFrame(
> [(Vectors.dense(features), int(label)) for features, label in zip(X, y)], ["features", "label"]
> )
> train, test = df.randomSplit([0.8, 0.2])lor = LogisticRegression(maxIter=5)
> ovr = OneVsRest(classifier=lor)ovrModel = ovr.fit(train)
> pred = ovrModel.transform(test)pred.show()
> # the line above fails with:
> # net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict
> # (for pyspark.ml.linalg.DenseVector)# ====================
> # INVESTIGATION
> # ====================pred.printSchema()
> # root
> # |-- features: vector (nullable = true)
> # |-- label: long (nullable = true)
> # |-- rawPrediction: string (nullable = true) <- why not numeric?
> # |-- prediction: double (nullable = true)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org