You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jungtaek Lim (Jira)" <ji...@apache.org> on 2022/12/04 23:37:00 UTC

[jira] [Created] (SPARK-41379) Inconsistency of spark session in DataFrame in user function for foreachBatch sink in PySpark

Jungtaek Lim created SPARK-41379:
------------------------------------

             Summary: Inconsistency of spark session in DataFrame in user function for foreachBatch sink in PySpark
                 Key: SPARK-41379
                 URL: https://issues.apache.org/jira/browse/SPARK-41379
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Structured Streaming
    Affects Versions: 3.3.2, 3.4.0
            Reporter: Jungtaek Lim


[https://docs.databricks.com/_static/notebooks/merge-in-streaming.html]

According to some manual testing against above code example in PySpark, it seems like the property of sparkSession in given DataFrame is not the same with cloned session in streaming query. In other words, {{df.sparkSession}} does not seem to be same with the cloned spark session which you can access via {{{}df._jdf.sparkSession(){}}}.

So which session to pick depends on the actual implementation of method in PySpark DataFrame, which users would never know. If it leads to pick the different session than expected, it leads to open backdoor for avoiding restrictions (e.g. AQE), unable to see session scoped resources (e.g. temp view), etc.

So it’s quite critical to sync two sessions to refer the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org