You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Franklyn Dsouza (JIRA)" <ji...@apache.org> on 2017/03/07 00:53:33 UTC
[jira] [Created] (SPARK-19844) UDF in when control function is
executed before the when clause is evaluated.
Franklyn Dsouza created SPARK-19844:
---------------------------------------
Summary: UDF in when control function is executed before the when clause is evaluated.
Key: SPARK-19844
URL: https://issues.apache.org/jira/browse/SPARK-19844
Project: Spark
Issue Type: Bug
Components: PySpark, SQL
Affects Versions: 2.1.0, 2.0.1
Reporter: Franklyn Dsouza
Sometimes we try to filter out the argument to a udf using {code}when(clause, udf).otherwise(default){code}
but we've noticed that sometimes the udf is being run on data that shouldn't have matched the clause.
heres some code to reproduce the issue.
{code}
from pyspark.sql import functions as F
from pyspark.sql import types
df = sc.sql.createDataFrame([{'r': None}], schema=types.StructType([types.StructField('r', types.StringType())]))
simple_udf = F.udf(lambda ref: ref.strip("/"), types.StringType())
df.withColumn('test',
F.when(F.col("r").isNotNull(), simple_udf(F.col("r")))
.otherwise(F.lit(None))
).collect()
{code}
This causes an exception because the udf is running on null data. i get AttributeError: 'NoneType' object has no attribute 'strip'.
so it looks like the udf is being evaluated before the clause in the when is evaulated. Oddly enough when i change {code}F.col("r").isNotNull(){code} to {code}df["r"] != None{code} then it works.
might be related to https://issues.apache.org/jira/browse/SPARK-13773
and https://issues.apache.org/jira/browse/SPARK-15282
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org