You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Daniel Solow (Jira)" <ji...@apache.org> on 2021/03/22 23:53:00 UTC
[jira] [Updated] (SPARK-34830) Some UDF calls inside transform are
broken
[ https://issues.apache.org/jira/browse/SPARK-34830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Solow updated SPARK-34830:
---------------------------------
Description:
Let's say I want to create a UDF to do a simple lookup on a string:
{code:java}
import org.apache.spark.sql.{functions => f}
val M = Map("a" -> "abc", "b" -> "defg")
val BM = spark.sparkContext.broadcast(M)
val LOOKUP = f.udf((s: String) => BM.value.get(s))
{code}
Now if I have the following dataframe:
{code:java}
val df = Seq(
Tuple1(Seq("a", "b"))
).toDF("arr")
{code}
and I want to run this UDF over each element in the array, I can do:
{code:java}
df.select(f.transform($"arr", i => LOOKUP(i)).as("arr")).show(false)
{code}
This should show:
{code:java}
+-----------+
|arr |
+-----------+
|[abc, defg]|
+-----------+
{code}
However it actually shows:
{code:java}
+-----------+
|arr |
+-----------+
|[def, defg]|
+-----------+
{code}
It's also broken for SQL (even without DSL). This gives the same result:
{code:java}
spark.udf.register("LOOKUP",(s: String) => BM.value.get(s))
df.selectExpr("TRANSFORM(arr, a -> LOOKUP(a)) AS arr").show(false)
{code}
Note that "def" is not even in the map I'm using.
This is a big problem because it breaks existing code/UDFs. I noticed this because the job I ported from 2.4.5 to 3.1.1 seemed to be working, but was actually producing broken data.
was:
Let's say I want to create a UDF to do a simple lookup on a string:
{code:java}
import org.apache.spark.sql.{functions => f}
val M = Map("a" -> "abc", "b" -> "defg")
val BM = spark.sparkContext.broadcast(M)
val LOOKUP = f.udf((s: String) => BM.value.get(s))
{code}
Now if I have the following dataframe:
{code:java}
val df = Seq(
Tuple1(Seq("a", "b"))
).toDF("arr")
{code}
and I want to run this UDF over each element in the array, I can do:
{code:java}
df.select(f.transform($"arr", i => LOOKUP(i)).as("arr")).show(false)
{code}
This should show:
{code:java}
+-----------+
|arr |
+-----------+
|[abc, defg]|
+-----------+
{code}
However it actually shows:
{code:java}
+-----------+
|arr |
+-----------+
|[def, defg]|
+-----------+
{code}
Note that "def" is not even in the map I'm using.
This is a big problem because it breaks existing code/UDFs. I noticed this because the job I ported from 2.4.5 to 3.1.1 seemed to be working, but was actually producing broken data.
> Some UDF calls inside transform are broken
> ------------------------------------------
>
> Key: SPARK-34830
> URL: https://issues.apache.org/jira/browse/SPARK-34830
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.1.1
> Reporter: Daniel Solow
> Priority: Major
>
> Let's say I want to create a UDF to do a simple lookup on a string:
> {code:java}
> import org.apache.spark.sql.{functions => f}
> val M = Map("a" -> "abc", "b" -> "defg")
> val BM = spark.sparkContext.broadcast(M)
> val LOOKUP = f.udf((s: String) => BM.value.get(s))
> {code}
> Now if I have the following dataframe:
> {code:java}
> val df = Seq(
> Tuple1(Seq("a", "b"))
> ).toDF("arr")
> {code}
> and I want to run this UDF over each element in the array, I can do:
> {code:java}
> df.select(f.transform($"arr", i => LOOKUP(i)).as("arr")).show(false)
> {code}
> This should show:
> {code:java}
> +-----------+
> |arr |
> +-----------+
> |[abc, defg]|
> +-----------+
> {code}
> However it actually shows:
> {code:java}
> +-----------+
> |arr |
> +-----------+
> |[def, defg]|
> +-----------+
> {code}
> It's also broken for SQL (even without DSL). This gives the same result:
> {code:java}
> spark.udf.register("LOOKUP",(s: String) => BM.value.get(s))
> df.selectExpr("TRANSFORM(arr, a -> LOOKUP(a)) AS arr").show(false)
> {code}
> Note that "def" is not even in the map I'm using.
> This is a big problem because it breaks existing code/UDFs. I noticed this because the job I ported from 2.4.5 to 3.1.1 seemed to be working, but was actually producing broken data.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org