You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Lukas Thaler (Jira)" <ji...@apache.org> on 2020/03/30 07:51:00 UTC

[jira] [Created] (SPARK-31299) Pyspark.ml.clustering illegalArgumentException with dataframe created from rows

Lukas Thaler created SPARK-31299:
------------------------------------

             Summary: Pyspark.ml.clustering illegalArgumentException with dataframe created from rows
                 Key: SPARK-31299
                 URL: https://issues.apache.org/jira/browse/SPARK-31299
             Project: Spark
          Issue Type: Bug
          Components: ML, PySpark
    Affects Versions: 2.4.0
            Reporter: Lukas Thaler


I hope this is the right place and way to report a bug in (at least) the PySpark API:

BisectingKMeans in the following example is only exemplary, the error occurs with all clustering algorithms:
{code:java}
from pyspark.sql import Row
from pyspark.mllib.linalg import DenseVector
from pyspark.ml.clustering import BisectingKMeansdata = spark.createDataFrame([Row(test_features=DenseVector([43.0, 0.0, 200.0, 1.0, 1.0, 1.0, 0.0, 3.0])),
 Row(test_features=DenseVector([44.0, 0.0, 250.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
 Row(test_features=DenseVector([23.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
 Row(test_features=DenseVector([25.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 2.0])),
 Row(test_features=DenseVector([19.0, 0.0, 200.0, 1.0, 0.0, 1.0, 0.0, 1.0]))])

kmeans = BisectingKMeans(featuresCol='test_features').setK(4).setSeed(1)
model = kmeans.fit(data)
{code}
The .fit-call in the last line will fail with the following error:
{code:java}
Py4JJavaError: An error occurred while calling o51.fit.
: java.lang.IllegalArgumentException: requirement failed: Column test_features must be of type equal to one of the following types: [struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, array<double>, array<float>] but was actually of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
{code}
As can be seen, the data type reported to be passed to the function is the first data type in the list of allowed data types, yet the call ends in an error because of it.

See my [StackOverflow issue|[https://stackoverflow.com/questions/60884142/pyspark-py4j-illegalargumentexception-with-spark-createdataframe-and-pyspark-ml]] for more context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org