You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Harutaka Kawamura (Jira)" <ji...@apache.org> on 2021/04/20 01:30:00 UTC

[jira] [Updated] (SPARK-35142) `OneVsRest` classifier uses incorrect data type for `rawPrediction` column

     [ https://issues.apache.org/jira/browse/SPARK-35142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Harutaka Kawamura updated SPARK-35142:
--------------------------------------
    Description: 
Code to reproduce the issue:
from pyspark.ml.classification import LogisticRegression, OneVsRest
from pyspark.ml.linalg import Vectors
from pyspark.sql import SparkSession
from sklearn.datasets import load_irisspark = SparkSession.builder.getOrCreate()
X, y = load_iris(return_X_y=True)
df = spark.createDataFrame(
    [(Vectors.dense(features), int(label)) for features, label in zip(X, y)], ["features", "label"]
)
train, test = df.randomSplit([0.8, 0.2])lor = LogisticRegression(maxIter=5)
ovr = OneVsRest(classifier=lor)ovrModel = ovr.fit(train)
pred = ovrModel.transform(test)pred.show()
# the line above fails with:
# net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict
# (for pyspark.ml.linalg.DenseVector)# ====================
# INVESTIGATION
# ====================pred.printSchema()
# root
#  |-- features: vector (nullable = true)
#  |-- label: long (nullable = true)
#  |-- rawPrediction: string (nullable = true)  <- why not numeric?
#  |-- prediction: double (nullable = true)

> `OneVsRest` classifier uses incorrect data type for `rawPrediction` column
> --------------------------------------------------------------------------
>
>                 Key: SPARK-35142
>                 URL: https://issues.apache.org/jira/browse/SPARK-35142
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 3.0.0, 3.0.2, 3.1.0, 3.1.1
>            Reporter: Harutaka Kawamura
>            Priority: Major
>
> Code to reproduce the issue:
> from pyspark.ml.classification import LogisticRegression, OneVsRest
> from pyspark.ml.linalg import Vectors
> from pyspark.sql import SparkSession
> from sklearn.datasets import load_irisspark = SparkSession.builder.getOrCreate()
> X, y = load_iris(return_X_y=True)
> df = spark.createDataFrame(
>     [(Vectors.dense(features), int(label)) for features, label in zip(X, y)], ["features", "label"]
> )
> train, test = df.randomSplit([0.8, 0.2])lor = LogisticRegression(maxIter=5)
> ovr = OneVsRest(classifier=lor)ovrModel = ovr.fit(train)
> pred = ovrModel.transform(test)pred.show()
> # the line above fails with:
> # net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict
> # (for pyspark.ml.linalg.DenseVector)# ====================
> # INVESTIGATION
> # ====================pred.printSchema()
> # root
> #  |-- features: vector (nullable = true)
> #  |-- label: long (nullable = true)
> #  |-- rawPrediction: string (nullable = true)  <- why not numeric?
> #  |-- prediction: double (nullable = true)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org