You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2022/04/28 10:57:00 UTC

[jira] [Resolved] (SPARK-38870) SparkSession.builder returns a new builder in Scala, but not in Python

     [ https://issues.apache.org/jira/browse/SPARK-38870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-38870.
----------------------------------
    Fix Version/s: 3.4.0
       Resolution: Fixed

Issue resolved by pull request 36161
[https://github.com/apache/spark/pull/36161]

> SparkSession.builder returns a new builder in Scala, but not in Python
> ----------------------------------------------------------------------
>
>                 Key: SPARK-38870
>                 URL: https://issues.apache.org/jira/browse/SPARK-38870
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 3.2.1
>            Reporter: Furcy Pin
>            Assignee: Furcy Pin
>            Priority: Major
>             Fix For: 3.4.0
>
>
> In pyspark, _SparkSession.builder_ always returns the same static builder, while the expected behaviour should be the same as in Scala, where it returns a new builder each time.
> *How to reproduce*
> When we run the following code in Scala :
> {code:java}
> import org.apache.spark.sql.SparkSession
> val s1 = SparkSession.builder.master("local[2]").config("key", "value").getOrCreate()
> println("A : " + s1.conf.get("key")) // value
> s1.conf.set("key", "new_value")
> println("B : " + s1.conf.get("key")) // new_value
> val s2 = SparkSession.builder.getOrCreate()
> println("C : " + s1.conf.get("key")) // new_value{code}
> The output is :
> {code:java}
> A : value
> B : new_value
> C : new_value   <<<<<<<<<<<{code}
>  
> But when we run the following (supposedly equivalent) code in Python:
> {code:java}
> from pyspark.sql import SparkSession
> s1 = SparkSession.builder.master("local[2]").config("key", "value").getOrCreate()
> print("A : " + s1.conf.get("key"))
> s1.conf.set("key", "new_value")
> print("B : " + s1.conf.get("key"))
> s2 = SparkSession.builder.getOrCreate()
> print("C : " + s1.conf.get("key")){code}
> The output is : 
> {code:java}
> A : value
> B : new_value
> C : value  <<<<<<<<<<<
> {code}
>  
>  
> *Root cause analysis*
> This comes from the fact that _SparkSession.builder_ behaves differently in Python than in Scala. In Scala, it returns a *new builder* each time, in Python it returns *the same builder* every time, and the SparkSession.Builder._options are static, too.
> Because of this, whenever _SparkSession.builder.getOrCreate()_ is called, the options passed to the very first builder are re-applied every time, and overrides the option that were set afterwards. 
> This leads to very awkward behavior in every Spark version up to 3.2.1 included
> {*}Example{*}:
> This example crashes, but was fixed by SPARK-37638
>  
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", "DYNAMIC").getOrCreate()
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "DYNAMIC" # OK
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # OK
> from pyspark.sql import functions as f
> from pyspark.sql.types import StringType
> f.col("a").cast(StringType()) 
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # This fails in all versions until the SPARK-37638 fix
> # because before that fix, Column.cast() calle SparkSession.builder.getOrCreate(){code}
>  
> But this example still crashes in the current version on the master branch
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", "DYNAMIC").getOrCreate()
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "DYNAMIC" # OK
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # OK
> SparkSession.builder.getOrCreate() 
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # This assert fails in master{code}
>  
> I made a Pull Request to fix this bug : https://github.com/apache/spark/pull/36161



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org