You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Daniel Solow (Jira)" <ji...@apache.org> on 2021/03/18 21:56:00 UTC
[jira] [Created] (SPARK-34794) Nested higher-order functions broken in DSL

Daniel Solow created SPARK-34794:
------------------------------------

             Summary: Nested higher-order functions broken in DSL
                 Key: SPARK-34794
                 URL: https://issues.apache.org/jira/browse/SPARK-34794
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.1.1
         Environment: 3.1.1
            Reporter: Daniel Solow


In Spark 3, if I have:

{code:java}
val df = Seq(
    (Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")
{code}
and I want to take the cross product of these two arrays, I can do the following in SQL:

{code:java}
df.selectExpr("""
    FLATTEN(
        TRANSFORM(
            numbers,
            number -> TRANSFORM(
                letters,
                letter -> (number AS number, letter AS letter)
            )
        )
    ) AS zipped
""").show(false)
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
{code}

This works fine. But when I try the equivalent using the scala DSL, the result is wrong:

{code:java}
df.select(
    f.flatten(
        f.transform(
            $"numbers",
            (number: Column) => { f.transform(
                $"letters",
                (letter: Column) => { f.struct(
                    number.as("number"),
                    letter.as("letter")
                ) }
            ) }
        )
    ).as("zipped")
).show(10, false)
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
{code}

Note that the numbers are not included in the output. The explain for this second version is:

{code:java}
== Parsed Logical Plan ==
'Project [flatten(transform('numbers, lambdafunction(transform('letters, lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, false))) AS zipped#444]
+- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
   +- LocalRelation [_1#303, _2#304]

== Analyzed Logical Plan ==
zipped: array<struct<number:string,letter:string>>
Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda x#446, false)), lambda x#445, false))) AS zipped#444]
+- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
   +- LocalRelation [_1#303, _2#304]

== Optimized Logical Plan ==
LocalRelation [zipped#444]

== Physical Plan ==
LocalTableScan [zipped#444]
{code}

Seems like variable name x is hardcoded. And sure enough: https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org