You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by abby <as...@doximity.com> on 2016/10/08 21:04:18 UTC

Apply UDF to SparseVector column in spark 2.0

Hi,

I am trying to apply a UDF to a column in a PySpark df containing
SparseVectors (created using pyspark.ml.feature.IDF). Originally, I was
trying to apply a more involved function, but am getting the same error with
any application of a function. So for the sake of an example:

udfSum = udf(lambda x: np.sum(x.values), FloatType()) 
df = df.withColumn("vec_sum", udfSum(df.idf))
df.take(10)

I am getting this error:

Py4JJavaError: An error occurred while calling
z:org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 55.0 failed 4 times, most recent failure: Lost task 0.3 in stage
55.0 (TID 111, 10.0.11.102): net.razorvine.pickle.PickleException: expected
zero arguments for construction of ClassDict (for numpy.dtype)

If I convert the df to Pandas and apply the function, I can confirm that
FloatType() is the correct response type. This seems related:
https://issues.apache.org/jira/browse/SPARK-7902, but I'm not sure how to
proceed. Any ideas?

Thanks!
Abby



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apply-UDF-to-SparseVector-column-in-spark-2-0-tp27870.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org