You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Daniel Solow (Jira)" <ji...@apache.org> on 2021/03/22 23:53:00 UTC

[jira] [Updated] (SPARK-34830) Some UDF calls inside transform are broken

     [ https://issues.apache.org/jira/browse/SPARK-34830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Solow updated SPARK-34830:
---------------------------------
    Description: 
Let's say I want to create a UDF to do a simple lookup on a string:

{code:java}
import org.apache.spark.sql.{functions => f}
val M = Map("a" -> "abc", "b" -> "defg")
val BM = spark.sparkContext.broadcast(M)
val LOOKUP = f.udf((s: String) => BM.value.get(s))
{code}

Now if I have the following dataframe:

{code:java}
val df = Seq(
    Tuple1(Seq("a", "b"))
).toDF("arr")
{code}

and I want to run this UDF over each element in the array, I can do:

{code:java}
df.select(f.transform($"arr", i => LOOKUP(i)).as("arr")).show(false)
{code}

This should show:

{code:java}
+-----------+
|arr        |
+-----------+
|[abc, defg]|
+-----------+
{code}
However it actually shows:

{code:java}
+-----------+
|arr        |
+-----------+
|[def, defg]|
+-----------+
{code}

It's also broken for SQL (even without DSL). This gives the same result:

{code:java}
spark.udf.register("LOOKUP",(s: String) => BM.value.get(s))
df.selectExpr("TRANSFORM(arr, a -> LOOKUP(a)) AS arr").show(false)
{code}


Note that "def" is not even in the map I'm using.

This is a big problem because it breaks existing code/UDFs. I noticed this because the job I ported from 2.4.5 to 3.1.1 seemed to be working, but was actually producing broken data.

  was:
Let's say I want to create a UDF to do a simple lookup on a string:

{code:java}
import org.apache.spark.sql.{functions => f}
val M = Map("a" -> "abc", "b" -> "defg")
val BM = spark.sparkContext.broadcast(M)
val LOOKUP = f.udf((s: String) => BM.value.get(s))
{code}

Now if I have the following dataframe:

{code:java}
val df = Seq(
    Tuple1(Seq("a", "b"))
).toDF("arr")
{code}

and I want to run this UDF over each element in the array, I can do:

{code:java}
df.select(f.transform($"arr", i => LOOKUP(i)).as("arr")).show(false)
{code}

This should show:

{code:java}
+-----------+
|arr        |
+-----------+
|[abc, defg]|
+-----------+
{code}
However it actually shows:

{code:java}
+-----------+
|arr        |
+-----------+
|[def, defg]|
+-----------+
{code}

Note that "def" is not even in the map I'm using.

This is a big problem because it breaks existing code/UDFs. I noticed this because the job I ported from 2.4.5 to 3.1.1 seemed to be working, but was actually producing broken data.


> Some UDF calls inside transform are broken
> ------------------------------------------
>
>                 Key: SPARK-34830
>                 URL: https://issues.apache.org/jira/browse/SPARK-34830
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.1
>            Reporter: Daniel Solow
>            Priority: Major
>
> Let's say I want to create a UDF to do a simple lookup on a string:
> {code:java}
> import org.apache.spark.sql.{functions => f}
> val M = Map("a" -> "abc", "b" -> "defg")
> val BM = spark.sparkContext.broadcast(M)
> val LOOKUP = f.udf((s: String) => BM.value.get(s))
> {code}
> Now if I have the following dataframe:
> {code:java}
> val df = Seq(
>     Tuple1(Seq("a", "b"))
> ).toDF("arr")
> {code}
> and I want to run this UDF over each element in the array, I can do:
> {code:java}
> df.select(f.transform($"arr", i => LOOKUP(i)).as("arr")).show(false)
> {code}
> This should show:
> {code:java}
> +-----------+
> |arr        |
> +-----------+
> |[abc, defg]|
> +-----------+
> {code}
> However it actually shows:
> {code:java}
> +-----------+
> |arr        |
> +-----------+
> |[def, defg]|
> +-----------+
> {code}
> It's also broken for SQL (even without DSL). This gives the same result:
> {code:java}
> spark.udf.register("LOOKUP",(s: String) => BM.value.get(s))
> df.selectExpr("TRANSFORM(arr, a -> LOOKUP(a)) AS arr").show(false)
> {code}
> Note that "def" is not even in the map I'm using.
> This is a big problem because it breaks existing code/UDFs. I noticed this because the job I ported from 2.4.5 to 3.1.1 seemed to be working, but was actually producing broken data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org