You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Nicholas Chammas (JIRA)" <ji...@apache.org> on 2016/11/03 15:35:58 UTC

[jira] [Created] (SPARK-18254) UDFs don't see aliased column names; somehow they get the original names

Nicholas Chammas created SPARK-18254:
----------------------------------------

             Summary: UDFs don't see aliased column names; somehow they get the original names
                 Key: SPARK-18254
                 URL: https://issues.apache.org/jira/browse/SPARK-18254
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.1
         Environment: Python 3.5, Java 8
            Reporter: Nicholas Chammas


Dunno if I'm misinterpreting something here, but this seems like a bug in how UDFs work, or in how they interface with the optimizer.

Here's a basic reproduction:

{code}
import pyspark
from pyspark.sql import Row
from pyspark.sql.functions import udf, col, struct


def length(full_name):
    # The non-aliased names, FIRST and LAST, show up here, instead of
    # first_name and last_name.
    # print(full_name)
    return len(full_name.first_name) + len(full_name.last_name)


if __name__ == '__main__':
    spark = (
        pyspark.sql.SparkSession.builder
        .getOrCreate())

    length_udf = udf(length)

    names = spark.createDataFrame([
        Row(FIRST='Nick', LAST='Chammas'),
        Row(FIRST='Walter', LAST='Williams'),
    ])

    names_cleaned = (
        names
        .select(
            col('FIRST').alias('first_name'),
            col('LAST').alias('last_name'),
        )
        .withColumn('full_name', struct('first_name', 'last_name'))
        .select('full_name'))

    # We see the schema we expect here.
    names_cleaned.printSchema()

    # However, here we get an AttributeError. length_udf() cannot
    # find first_name or last_name.
    (names_cleaned
    .withColumn('length', length_udf('full_name'))
    .show())
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org