You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/01/01 03:12:00 UTC

[jira] [Resolved] (SPARK-26510) Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 'createOrReplaceTempView'

     [ https://issues.apache.org/jira/browse/SPARK-26510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-26510.
----------------------------------
    Resolution: Cannot Reproduce

This is fixed in the current master. It should be great if the JIRA fixing that problem is identified, and backported if applicable.

> Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 'createOrReplaceTempView'
> --------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-26510
>                 URL: https://issues.apache.org/jira/browse/SPARK-26510
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.3.0
>            Reporter: Hagai Attias
>            Priority: Major
>
> It seems that there's a change of behaviour between 1.6 and 2.3 when caching a Dataframe and saving it as a temp table. In 1.6, the following code executed {{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it twice.
>  
> {code:java|title=RegisterTest_spark1.6.scala|borderStyle=solid}
>  
> val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
> val schema = StructType(StructField("num", IntegerType) :: Nil)
> val df1 = sql.createDataFrame(rdd, schema)
> df1.registerTempTable("data_table")
> sql.udf.register("printUDF", (x:Int) => {print(x)
>   x
> })
> val df2 = sql.sql("select printUDF(num) result from data_table").cache()
> df2.collect() //execute cache
> df2.registerTempTable("cached_table")
> val df3 = sql.sql("select result from cached_table")
> df3.collect()
> {code}
> {code:java|title=RegisterTest_spark2.3.scala|borderStyle=solid}
>  
> val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))
> val schema = StructType(StructField("num", IntegerType) :: Nil)
> val df1 = session.createDataFrame(rdd, schema)
> df1.createOrReplaceTempView("data_table")
> session.udf.register("printUDF", (x:Int) => {print(x)
>   x
> })
> val df2 = session.sql("select printUDF(num) result from data_table").cache()
> df2.collect() //execute cache
> df2.createOrReplaceTempView("cached_table")
> val df3 = session.sql("select result from cached_table")
> df3.collect()
> {code}
>  
> 1.6 prints `123` while 2.3 prints `123123`, thus evaluating the dataframe twice.
> Managed to mitigate by skipping the temporary table and selecting directly from the cached dataframe, but was wondering if that is an expected behavior / known issue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org