You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Nicholas Chammas (JIRA)" <ji...@apache.org> on 2016/11/03 15:49:58 UTC

[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names; somehow they get the original names

    [ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633220#comment-15633220 ] 

Nicholas Chammas commented on SPARK-18254:
------------------------------------------

[~marmbrus] / [~hvanhovell]: Is there a workaround for this issue?

> UDFs don't see aliased column names; somehow they get the original names
> ------------------------------------------------------------------------
>
>                 Key: SPARK-18254
>                 URL: https://issues.apache.org/jira/browse/SPARK-18254
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.1
>         Environment: Python 3.5, Java 8
>            Reporter: Nicholas Chammas
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction:
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
>     # The non-aliased names, FIRST and LAST, show up here, instead of
>     # first_name and last_name.
>     # print(full_name)
>     return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
>     spark = (
>         pyspark.sql.SparkSession.builder
>         .getOrCreate())
>     length_udf = udf(length)
>     names = spark.createDataFrame([
>         Row(FIRST='Nick', LAST='Chammas'),
>         Row(FIRST='Walter', LAST='Williams'),
>     ])
>     names_cleaned = (
>         names
>         .select(
>             col('FIRST').alias('first_name'),
>             col('LAST').alias('last_name'),
>         )
>         .withColumn('full_name', struct('first_name', 'last_name'))
>         .select('full_name'))
>     # We see the schema we expect here.
>     names_cleaned.printSchema()
>     # However, here we get an AttributeError. length_udf() cannot
>     # find first_name or last_name.
>     (names_cleaned
>     .withColumn('length', length_udf('full_name'))
>     .show())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org