You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Herman van Hovell (JIRA)" <ji...@apache.org> on 2016/11/03 16:41:58 UTC
[jira] [Issue Comment Deleted] (SPARK-18254) UDFs don't see aliased
column names
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Herman van Hovell updated SPARK-18254:
--------------------------------------
Comment: was deleted
(was: How can you take the length of a row? Do you expect these to be concatenated?
(BTW: we have a length(...) udf in spark))
> UDFs don't see aliased column names
> -----------------------------------
>
> Key: SPARK-18254
> URL: https://issues.apache.org/jira/browse/SPARK-18254
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.1
> Environment: Python 3.5, Java 8
> Reporter: Nicholas Chammas
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction:
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
> # The non-aliased names, FIRST and LAST, show up here, instead of
> # first_name and last_name.
> # print(full_name)
> return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
> spark = (
> pyspark.sql.SparkSession.builder
> .getOrCreate())
> length_udf = udf(length)
> names = spark.createDataFrame([
> Row(FIRST='Nick', LAST='Chammas'),
> Row(FIRST='Walter', LAST='Williams'),
> ])
> names_cleaned = (
> names
> .select(
> col('FIRST').alias('first_name'),
> col('LAST').alias('last_name'),
> )
> .withColumn('full_name', struct('first_name', 'last_name'))
> .select('full_name'))
> # We see the schema we expect here.
> names_cleaned.printSchema()
> # However, here we get an AttributeError. length_udf() cannot
> # find first_name or last_name.
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .show())
> {code}
> When I run this I get a long stack trace, but the relevant portions seem to be:
> {code}
> File ".../udf-alias.py", line 10, in length
> return len(full_name.first_name) + len(full_name.last_name)
> File "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", line 1502, in __getattr__
> raise AttributeError(item)
> AttributeError: first_name
> {code}
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
> File "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", line 1497, in __getattr__
> idx = self.__fields__.index(item)
> ValueError: 'first_name' is not in list
> {code}
> Here are the relevant query plans:
> {code}
> names_cleaned.explain()
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS full_name#10]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> {code}
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .explain())
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS full_name#10, pythonUDF0#21 AS length#17]
> +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, pythonUDF0#21]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org