You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Guy Boo (Jira)" <ji...@apache.org> on 2022/11/08 11:26:00 UTC

[jira] [Created] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions

Guy Boo created SPARK-41049:
-------------------------------

             Summary: Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions
                 Key: SPARK-41049
                 URL: https://issues.apache.org/jira/browse/SPARK-41049
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.1.2
            Reporter: Guy Boo


h2. Expectation

For a given row, Nondeterministic expressions should have stable values.
{code:scala}
import org.apache.spark.sql.functions._
val df = sparkContext.parallelize(1 to 5).toDF("x")
val v1 = rand().*(lit(10000)) 
df.select(v1, v1).collect{code}
Should return a set where both columns always have the same value, but what that value is changes from row to row. This is true for composed expressions as well:
{code:scala}
df.select(v1.cast(IntegerType), v1.cast(IntegerType)).collect
{code}
should still have the same value in both columns. This is different from the following:
{code:scala}
df.select(rand(), rand()).collect{code}
Should always have different values in each column, because the two rand() calls refer to different invocations.
h2. Problem

This expectation does not appear to be stable in the event that any subsequent expression is a CodegenFallback. This program:
{code:scala}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

val sparkSession = SparkSession.builder().getOrCreate()
val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x")
val v1 = rand().*(lit(10000)).cast(IntegerType)
val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback
df.select(v1, v1, v2, v2).collect {code}
produces output like this:
|8159|8159|8159|{color:#FF0000}2028{color}|
|8320|8320|8320|{color:#FF0000}1640{color}|
|7937|7937|7937|{color:#FF0000}769{color}|
|436|436|436|{color:#FF0000}8924{color}|
|8924|8924|2827|{color:#FF0000}2731{color}|

Not sure why the first call via the CodegenFallback path should be correct while subsequent calls aren't.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org