You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/07 05:16:58 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

HyukjinKwon opened a new pull request #35410:
URL: https://github.com/apache/spark/pull/35410

### What changes were proposed in this pull request?

This PR proposes to `SparkSession` within PySpark. This is a base work for respecting runtime configuration, etc. Currently, we rely on old deprecated `SQLContext` internally

In addition, this PR also proposes to expose `DataFrame.sparkSession` like Scala API does.

Lastly, it contained a bit of refactoring such as:
- Move `SQLContext._conf` -> `SparkSession._jconf`..
- Rename `rdd_array` to `df_array` at `DataFrame.randomSplit`.
- Deprecate and warn `DataFrame.sql_ctx` and `DataFrame(..., sql_ctx)`.

### Why are the changes needed?

- This is a base work for PySpark to respect runtime configuration.
- To expose the same API layer as Scala API (`df.sparkSession`)
- To avoid relaying on old `SQLContext`.

### Does this PR introduce _any_ user-facing change?

Yes.

- Deprecate and warn `DataFrame.sql_ctx` and `DataFrame(..., sql_ctx)`.
- New API `DataFrame.sparkSession`

### How was this patch tested?

Existing test cases should cover them.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r806358814



##########
File path: python/pyspark/sql/context.py
##########
@@ -714,14 +711,13 @@ def __init__(self, sparkContext: SparkContext, jhiveContext: Optional[JavaObject
             + "SparkSession.builder.enableHiveSupport().getOrCreate() instead.",
             FutureWarning,
         )
+        static_conf = {}
         if jhiveContext is None:
-            sparkContext._conf.set(  # type: ignore[attr-defined]
-                "spark.sql.catalogImplementation", "hive"
-            )
-            sparkSession = SparkSession.builder._sparkContext(sparkContext).getOrCreate()
-        else:
-            sparkSession = SparkSession(sparkContext, jhiveContext.sparkSession())
-        SQLContext.__init__(self, sparkContext, sparkSession, jhiveContext)
+            static_conf = {"spark.sql.catalogImplementation": "in-memory"}

Review comment:
       Shoot. I pushed a wrong change in the last minute - I was manually testing if this value is correctly set or not. I has to be `hive`. This uses an existing session that has `hive` by default in tests so it doesn't affect the tests but would impact shell if users use `HiveContext` directly without the existing Spark session.
   
   Sorry this was my mistake. I will make a quick followup.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #35410:
URL: https://github.com/apache/spark/pull/35410#issuecomment-1039723927


   @JoshRosen when you find some time, mind taking a look as a post-review? Just wanted to make sure I haven't missed anything.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

zero323 commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r805165961



##########
File path: python/pyspark/sql/session.py
##########
@@ -348,6 +336,11 @@ def _repr_html_(self) -> str:
             sc_HTML=self.sparkContext._repr_html_(),  # type: ignore[attr-defined]
         )
 
+    @property
+    def _jconf(self) -> SparkConf:

Review comment:
       This should be `-> "JavaObject"`, shouldn't it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #35410:
URL: https://github.com/apache/spark/pull/35410#issuecomment-1039723593


   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #35410:
URL: https://github.com/apache/spark/pull/35410


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

zero323 commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r805166659



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -603,25 +601,25 @@ def _create_from_pandas_with_arrow(
             for pdf_slice in pdf_slices
         ]
 
-        jsqlContext = self._wrapped._jsqlContext  # type: ignore[attr-defined]
+        jsparkSession = self._jsparkSession  # type: ignore[attr-defined]

Review comment:
       We can add a hint 
   
   ```python
   _jsparkSession: "JavaObject"
   ```
   
   at the top of the class for this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r810431358



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -116,13 +140,45 @@ def __init__(self, jdf: JavaObject, sql_ctx: "SQLContext"):
         # by __repr__ and _repr_html_ while eager evaluation opened.
         self._support_repr_html = False
 
+    @property
+    def sql_ctx(self) -> "SQLContext":
+        from pyspark.sql.context import SQLContext
+
+        warnings.warn(
+            "DataFrame.sql_ctx is an internal property, and will be removed "
+            "in future releases. Use DataFrame.sparkSession instead."
+        )
+        if self._sql_ctx is None:
+            self._sql_ctx = SQLContext._get_or_create(self._sc)
+        return self._sql_ctx
+
+    @property  # type: ignore[misc]
+    def sparkSession(self) -> "SparkSession":
+        """Returns Spark session that created this :class:`DataFrame`.
+
+        .. versionadded:: 3.3.0
+
+        Examples
+        --------
+        >>> df = spark.range(1)
+        >>> type(df.sparkSession)
+        <class 'pyspark.sql.session.SparkSession'>
+        """
+        from pyspark.sql.session import SparkSession
+
+        if self._session is None:
+            self._session = SparkSession._getActiveSessionOrCreate()

Review comment:
       Yes, let me make a quick followup after double checking it (see https://github.com/apache/spark/pull/35575).
   
   For a bit of clarification, this code path will be triggered when thirdparty libraries or external users directly create `DataFrame(..., SQLContext)` so standard public APIs won't be affected. So the only case it gets affected is when a user manually creates a `DataFrame(..., SQLContext)`, and they call this new API `df.sparkSession`, which is It's pretty much less likely and rather minor especially given that 1. I don't think `SQLContext` users would care about `SparkSession`, 2. `SQLContext` is deprecated many years ago, 3. DataFrame(...) constructor is subjected to be internal.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #35410:
URL: https://github.com/apache/spark/pull/35410#issuecomment-1031334185


   will take another look tmr to clean up


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r800321607



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -104,9 +104,24 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
     .. versionadded:: 1.3.0
     """
 
-    def __init__(self, jdf: JavaObject, sql_ctx: "SQLContext"):

Review comment:
       Here's the main change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r804372377



##########
File path: python/pyspark/sql/session.py
##########
@@ -230,11 +230,6 @@ def enableHiveSupport(self) -> "SparkSession.Builder":
             """
             return self.config("spark.sql.catalogImplementation", "hive")
 
-        def _sparkContext(self, sc: SparkContext) -> "SparkSession.Builder":
-            with self._lock:
-                self._sc = sc
-                return self

Review comment:
       And this func is private




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #35410:
URL: https://github.com/apache/spark/pull/35410#issuecomment-1037173841


   Will merge this in few days after double checking one more time ... if we're all fine with it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #35410:
URL: https://github.com/apache/spark/pull/35410#issuecomment-1035903593


   Thanks @BryanCutler. I made a bit of more cleanups here while checking line by line with manual tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #35410:
URL: https://github.com/apache/spark/pull/35410


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r810431358



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -116,13 +140,45 @@ def __init__(self, jdf: JavaObject, sql_ctx: "SQLContext"):
         # by __repr__ and _repr_html_ while eager evaluation opened.
         self._support_repr_html = False
 
+    @property
+    def sql_ctx(self) -> "SQLContext":
+        from pyspark.sql.context import SQLContext
+
+        warnings.warn(
+            "DataFrame.sql_ctx is an internal property, and will be removed "
+            "in future releases. Use DataFrame.sparkSession instead."
+        )
+        if self._sql_ctx is None:
+            self._sql_ctx = SQLContext._get_or_create(self._sc)
+        return self._sql_ctx
+
+    @property  # type: ignore[misc]
+    def sparkSession(self) -> "SparkSession":
+        """Returns Spark session that created this :class:`DataFrame`.
+
+        .. versionadded:: 3.3.0
+
+        Examples
+        --------
+        >>> df = spark.range(1)
+        >>> type(df.sparkSession)
+        <class 'pyspark.sql.session.SparkSession'>
+        """
+        from pyspark.sql.session import SparkSession
+
+        if self._session is None:
+            self._session = SparkSession._getActiveSessionOrCreate()

Review comment:
       This code path will be triggered when thirdparty libraries or external users directly create `DataFrame(..., SQLContext)` so standard public APIs won't be affected. So the only case it gets affected is when a user manually creates a `DataFrame(..., SQLContext)`, and they call this new API `df.sparkSession`. It's pretty much less likely and rather minor especially given that 1. I don't think `SQLContext` users would care about `SparkSession` and `SQLContext` is deprecated many years ago.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [WIP][SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r803206088



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -104,10 +105,25 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
     .. versionadded:: 1.3.0
     """
 
-    def __init__(self, jdf: JavaObject, sql_ctx: "SQLContext"):
-        self._jdf = jdf
-        self.sql_ctx = sql_ctx
-        self._sc: SparkContext = cast(SparkContext, sql_ctx and sql_ctx._sc)
+    def __init__(
+        self,
+        jdf: JavaObject,
+        sql_ctx: Optional["SQLContext"] = None,
+        session: Optional["SparkSession"] = None,
+    ):
+        assert sql_ctx is not None or session is not None
+        self._session = session
+        self._sql_ctx = sql_ctx

Review comment:
       That's a good idea .. let me think and update a bit more.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a change in pull request #35410: [WIP][SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r803491903



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -102,12 +104,34 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
           .groupBy(department.name, "gender").agg({"salary": "avg", "age": "max"})
 
     .. versionadded:: 1.3.0
+
+    .. note: A DataFrame should only be created as described above. It should not be directly
+        created via using the constructor.
     """
 
-    def __init__(self, jdf: JavaObject, sql_ctx: "SQLContext"):
-        self._jdf = jdf
-        self.sql_ctx = sql_ctx
-        self._sc: SparkContext = cast(SparkContext, sql_ctx and sql_ctx._sc)
+    def __init__(
+        self,
+        jdf: JavaObject,
+        sql_ctx: Optional[Union["SQLContext", "SparkSession"]],
+    ):
+        from pyspark.sql.session import SparkSession
+
+        self._session: Optional["SparkSession"] = None
+        self._sql_ctx: Optional["SQLContext"] = None
+
+        if isinstance(sql_ctx, SparkSession):
+            self._session = sql_ctx
+        else:
+            assert not os.environ.get("SPARK_TESTING")  # Sanity check for our internal usage.
+            assert isinstance(sql_ctx, SQLContext)
+            # We should remove this if-else branch in the future release, and rename
+            # sql_ctx to session in the constructor. This is an internal code path but
+            # was kept with an warning because but used intensively by third-party libraries.

Review comment:
       nit: because `it's`? used intensively by third-party libraries




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r810437778



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -116,13 +140,45 @@ def __init__(self, jdf: JavaObject, sql_ctx: "SQLContext"):
         # by __repr__ and _repr_html_ while eager evaluation opened.
         self._support_repr_html = False
 
+    @property
+    def sql_ctx(self) -> "SQLContext":
+        from pyspark.sql.context import SQLContext
+
+        warnings.warn(
+            "DataFrame.sql_ctx is an internal property, and will be removed "
+            "in future releases. Use DataFrame.sparkSession instead."
+        )
+        if self._sql_ctx is None:
+            self._sql_ctx = SQLContext._get_or_create(self._sc)
+        return self._sql_ctx
+
+    @property  # type: ignore[misc]
+    def sparkSession(self) -> "SparkSession":
+        """Returns Spark session that created this :class:`DataFrame`.
+
+        .. versionadded:: 3.3.0
+
+        Examples
+        --------
+        >>> df = spark.range(1)
+        >>> type(df.sparkSession)
+        <class 'pyspark.sql.session.SparkSession'>
+        """
+        from pyspark.sql.session import SparkSession
+
+        if self._session is None:
+            self._session = SparkSession._getActiveSessionOrCreate()

Review comment:
       https://github.com/apache/spark/pull/35575




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

zero323 commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r805165758



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -88,9 +88,10 @@ def toPandas(self) -> "PandasDataFrameLike":
         import pandas as pd
         from pandas.core.dtypes.common import is_timedelta64_dtype
 
-        timezone = self.sql_ctx._conf.sessionLocalTimeZone()  # type: ignore[attr-defined]
+        jconf = self.sparkSession._jconf  # type: ignore[attr-defined]

Review comment:
       This `ignore` shouldn't be necessary.

##########
File path: python/pyspark/sql/session.py
##########
@@ -348,6 +336,11 @@ def _repr_html_(self) -> str:
             sc_HTML=self.sparkContext._repr_html_(),  # type: ignore[attr-defined]
         )
 
+    @property
+    def _jconf(self) -> SparkConf:

Review comment:
       This should be `-> "JavaObject"`, shouldn't it?

##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -88,9 +88,10 @@ def toPandas(self) -> "PandasDataFrameLike":
         import pandas as pd
         from pandas.core.dtypes.common import is_timedelta64_dtype
 
-        timezone = self.sql_ctx._conf.sessionLocalTimeZone()  # type: ignore[attr-defined]
+        jconf = self.sparkSession._jconf  # type: ignore[attr-defined]
+        timezone = jconf.sessionLocalTimeZone()  # type: ignore[attr-defined]

Review comment:
       This one, and the following, also once `SparkSession._jconf` return type is fixed.

##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -603,25 +601,25 @@ def _create_from_pandas_with_arrow(
             for pdf_slice in pdf_slices
         ]
 
-        jsqlContext = self._wrapped._jsqlContext  # type: ignore[attr-defined]
+        jsparkSession = self._jsparkSession  # type: ignore[attr-defined]

Review comment:
       We can add a hint 
   
   ```python
   _jsparkSession: "JavaObject"
   ```
   
   at the top of the class for this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #35410:
URL: https://github.com/apache/spark/pull/35410#issuecomment-1039723593






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r806358814



##########
File path: python/pyspark/sql/context.py
##########
@@ -714,14 +711,13 @@ def __init__(self, sparkContext: SparkContext, jhiveContext: Optional[JavaObject
             + "SparkSession.builder.enableHiveSupport().getOrCreate() instead.",
             FutureWarning,
         )
+        static_conf = {}
         if jhiveContext is None:
-            sparkContext._conf.set(  # type: ignore[attr-defined]
-                "spark.sql.catalogImplementation", "hive"
-            )
-            sparkSession = SparkSession.builder._sparkContext(sparkContext).getOrCreate()
-        else:
-            sparkSession = SparkSession(sparkContext, jhiveContext.sparkSession())
-        SQLContext.__init__(self, sparkContext, sparkSession, jhiveContext)
+            static_conf = {"spark.sql.catalogImplementation": "in-memory"}

Review comment:
       Shoot. I pushed a wrong change in the last minute - I was manually testing if this value is correctly set or not. I has to be `hive`. This uses an existing session that has `hive` by default in tests so it doesn't affect the tests but would impact shell if users use `HiveContext` directly.
   
   Sorry this was my mistake. I will make a quick followup.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r810432982



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -116,13 +140,45 @@ def __init__(self, jdf: JavaObject, sql_ctx: "SQLContext"):
         # by __repr__ and _repr_html_ while eager evaluation opened.
         self._support_repr_html = False
 
+    @property
+    def sql_ctx(self) -> "SQLContext":
+        from pyspark.sql.context import SQLContext
+
+        warnings.warn(
+            "DataFrame.sql_ctx is an internal property, and will be removed "
+            "in future releases. Use DataFrame.sparkSession instead."
+        )
+        if self._sql_ctx is None:
+            self._sql_ctx = SQLContext._get_or_create(self._sc)
+        return self._sql_ctx
+
+    @property  # type: ignore[misc]
+    def sparkSession(self) -> "SparkSession":
+        """Returns Spark session that created this :class:`DataFrame`.
+
+        .. versionadded:: 3.3.0
+
+        Examples
+        --------
+        >>> df = spark.range(1)
+        >>> type(df.sparkSession)
+        <class 'pyspark.sql.session.SparkSession'>
+        """
+        from pyspark.sql.session import SparkSession
+
+        if self._session is None:
+            self._session = SparkSession._getActiveSessionOrCreate()

Review comment:
       Let me make a quick followup after double checking it. At least we could change as you suggested, e.g.)
   
   ```python
   self._session = self._sql_ctx.sparkSession
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r810431358



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -116,13 +140,45 @@ def __init__(self, jdf: JavaObject, sql_ctx: "SQLContext"):
         # by __repr__ and _repr_html_ while eager evaluation opened.
         self._support_repr_html = False
 
+    @property
+    def sql_ctx(self) -> "SQLContext":
+        from pyspark.sql.context import SQLContext
+
+        warnings.warn(
+            "DataFrame.sql_ctx is an internal property, and will be removed "
+            "in future releases. Use DataFrame.sparkSession instead."
+        )
+        if self._sql_ctx is None:
+            self._sql_ctx = SQLContext._get_or_create(self._sc)
+        return self._sql_ctx
+
+    @property  # type: ignore[misc]
+    def sparkSession(self) -> "SparkSession":
+        """Returns Spark session that created this :class:`DataFrame`.
+
+        .. versionadded:: 3.3.0
+
+        Examples
+        --------
+        >>> df = spark.range(1)
+        >>> type(df.sparkSession)
+        <class 'pyspark.sql.session.SparkSession'>
+        """
+        from pyspark.sql.session import SparkSession
+
+        if self._session is None:
+            self._session = SparkSession._getActiveSessionOrCreate()

Review comment:
       Yes, let me make a quick followup after double checking it (see https://github.com/apache/spark/pull/35575).
   
   For a bit of clarification, this code path will be triggered when thirdparty libraries or external users directly create `DataFrame(..., SQLContext)` so standard public APIs won't be affected. So the only case it gets affected is when a user manually creates a `DataFrame(..., SQLContext)`, and they call this new API `df.sparkSession`, which is It's pretty much less likely and rather minor especially given that 1. I don't think `SQLContext` users would care about `SparkSession`, `SQLContext` is deprecated many years ago, 3. DataFrame(...) constructor is subjected to be internal.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35410: [WIP][SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #35410:
URL: https://github.com/apache/spark/pull/35410#issuecomment-1032422008


   Still WIP ... 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #35410:
URL: https://github.com/apache/spark/pull/35410#issuecomment-1037613214


   Thanks @zero323. I addressed your comments about type hints. I will take another look before merging this in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r806358814



##########
File path: python/pyspark/sql/context.py
##########
@@ -714,14 +711,13 @@ def __init__(self, sparkContext: SparkContext, jhiveContext: Optional[JavaObject
             + "SparkSession.builder.enableHiveSupport().getOrCreate() instead.",
             FutureWarning,
         )
+        static_conf = {}
         if jhiveContext is None:
-            sparkContext._conf.set(  # type: ignore[attr-defined]
-                "spark.sql.catalogImplementation", "hive"
-            )
-            sparkSession = SparkSession.builder._sparkContext(sparkContext).getOrCreate()
-        else:
-            sparkSession = SparkSession(sparkContext, jhiveContext.sparkSession())
-        SQLContext.__init__(self, sparkContext, sparkSession, jhiveContext)
+            static_conf = {"spark.sql.catalogImplementation": "in-memory"}

Review comment:
       Shoot. I pushed a wrong change in the last minute - I was manually testing if this value is correctly set or not. I has to be `hive`. This uses an existing session that has `hive` by default in tests so it doesn't affect the tests but would affect the production codes if users use `HiveContext` directly without the existing Spark session.
   
   Sorry this was my mistake. I will make a quick followup.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r806371427



##########
File path: python/pyspark/sql/context.py
##########
@@ -714,14 +711,13 @@ def __init__(self, sparkContext: SparkContext, jhiveContext: Optional[JavaObject
             + "SparkSession.builder.enableHiveSupport().getOrCreate() instead.",
             FutureWarning,
         )
+        static_conf = {}
         if jhiveContext is None:
-            sparkContext._conf.set(  # type: ignore[attr-defined]
-                "spark.sql.catalogImplementation", "hive"
-            )
-            sparkSession = SparkSession.builder._sparkContext(sparkContext).getOrCreate()
-        else:
-            sparkSession = SparkSession(sparkContext, jhiveContext.sparkSession())
-        SQLContext.__init__(self, sparkContext, sparkSession, jhiveContext)
+            static_conf = {"spark.sql.catalogImplementation": "in-memory"}

Review comment:
       https://github.com/apache/spark/pull/35517




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r810431358



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -116,13 +140,45 @@ def __init__(self, jdf: JavaObject, sql_ctx: "SQLContext"):
         # by __repr__ and _repr_html_ while eager evaluation opened.
         self._support_repr_html = False
 
+    @property
+    def sql_ctx(self) -> "SQLContext":
+        from pyspark.sql.context import SQLContext
+
+        warnings.warn(
+            "DataFrame.sql_ctx is an internal property, and will be removed "
+            "in future releases. Use DataFrame.sparkSession instead."
+        )
+        if self._sql_ctx is None:
+            self._sql_ctx = SQLContext._get_or_create(self._sc)
+        return self._sql_ctx
+
+    @property  # type: ignore[misc]
+    def sparkSession(self) -> "SparkSession":
+        """Returns Spark session that created this :class:`DataFrame`.
+
+        .. versionadded:: 3.3.0
+
+        Examples
+        --------
+        >>> df = spark.range(1)
+        >>> type(df.sparkSession)
+        <class 'pyspark.sql.session.SparkSession'>
+        """
+        from pyspark.sql.session import SparkSession
+
+        if self._session is None:
+            self._session = SparkSession._getActiveSessionOrCreate()

Review comment:
       Yes, let me make a quick followup after double checking it. At least we could change as you suggested (see https://github.com/apache/spark/pull/35575).
   
   For a bit of clarification, this code path will be triggered when thirdparty libraries or external users directly create `DataFrame(..., SQLContext)` so standard public APIs won't be affected. So the only case it gets affected is when a user manually creates a `DataFrame(..., SQLContext)`, and they call this new API `df.sparkSession`, which is It's pretty much less likely and rather minor especially given that 1. I don't think `SQLContext` users would care about `SparkSession` and `SQLContext` is deprecated many years ago.

##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -116,13 +140,45 @@ def __init__(self, jdf: JavaObject, sql_ctx: "SQLContext"):
         # by __repr__ and _repr_html_ while eager evaluation opened.
         self._support_repr_html = False
 
+    @property
+    def sql_ctx(self) -> "SQLContext":
+        from pyspark.sql.context import SQLContext
+
+        warnings.warn(
+            "DataFrame.sql_ctx is an internal property, and will be removed "
+            "in future releases. Use DataFrame.sparkSession instead."
+        )
+        if self._sql_ctx is None:
+            self._sql_ctx = SQLContext._get_or_create(self._sc)
+        return self._sql_ctx
+
+    @property  # type: ignore[misc]
+    def sparkSession(self) -> "SparkSession":
+        """Returns Spark session that created this :class:`DataFrame`.
+
+        .. versionadded:: 3.3.0
+
+        Examples
+        --------
+        >>> df = spark.range(1)
+        >>> type(df.sparkSession)
+        <class 'pyspark.sql.session.SparkSession'>
+        """
+        from pyspark.sql.session import SparkSession
+
+        if self._session is None:
+            self._session = SparkSession._getActiveSessionOrCreate()

Review comment:
       Let me make a quick followup after double checking it. At least we could change as you suggested, e.g.)
   
   ```python
   self._session = self._sql_ctx.sparkSession
   ```

##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -116,13 +140,45 @@ def __init__(self, jdf: JavaObject, sql_ctx: "SQLContext"):
         # by __repr__ and _repr_html_ while eager evaluation opened.
         self._support_repr_html = False
 
+    @property
+    def sql_ctx(self) -> "SQLContext":
+        from pyspark.sql.context import SQLContext
+
+        warnings.warn(
+            "DataFrame.sql_ctx is an internal property, and will be removed "
+            "in future releases. Use DataFrame.sparkSession instead."
+        )
+        if self._sql_ctx is None:
+            self._sql_ctx = SQLContext._get_or_create(self._sc)
+        return self._sql_ctx
+
+    @property  # type: ignore[misc]
+    def sparkSession(self) -> "SparkSession":
+        """Returns Spark session that created this :class:`DataFrame`.
+
+        .. versionadded:: 3.3.0
+
+        Examples
+        --------
+        >>> df = spark.range(1)
+        >>> type(df.sparkSession)
+        <class 'pyspark.sql.session.SparkSession'>
+        """
+        from pyspark.sql.session import SparkSession
+
+        if self._session is None:
+            self._session = SparkSession._getActiveSessionOrCreate()

Review comment:
       Yes, let me make a quick followup after double checking it (see https://github.com/apache/spark/pull/35575).
   
   For a bit of clarification, this code path will be triggered when thirdparty libraries or external users directly create `DataFrame(..., SQLContext)` so standard public APIs won't be affected. So the only case it gets affected is when a user manually creates a `DataFrame(..., SQLContext)`, and they call this new API `df.sparkSession`, which is It's pretty much less likely and rather minor especially given that 1. I don't think `SQLContext` users would care about `SparkSession` and `SQLContext` is deprecated many years ago.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [WIP][SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r802219772



##########
File path: python/pyspark/sql/session.py
##########
@@ -348,6 +344,11 @@ def _repr_html_(self) -> str:
             sc_HTML=self.sparkContext._repr_html_(),  # type: ignore[attr-defined]
         )
 
+    @property
+    def _jconf(self) -> SparkConf:
+        """Accessor for the JVM SQL-specific configurations"""
+        return self._jsparkSession.sessionState().conf()

Review comment:
       I think we should get this from SQLConf.get to respect the current `SparkSession`. I will do it in a separate PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on a change in pull request #35410: [WIP][SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

zero323 commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r802573722



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -104,10 +105,25 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
     .. versionadded:: 1.3.0
     """
 
-    def __init__(self, jdf: JavaObject, sql_ctx: "SQLContext"):
-        self._jdf = jdf
-        self.sql_ctx = sql_ctx
-        self._sc: SparkContext = cast(SparkContext, sql_ctx and sql_ctx._sc)
+    def __init__(
+        self,
+        jdf: JavaObject,
+        sql_ctx: Optional["SQLContext"] = None,
+        session: Optional["SparkSession"] = None,
+    ):
+        assert sql_ctx is not None or session is not None
+        self._session = session
+        self._sql_ctx = sql_ctx

Review comment:
       Just thinking out loud. Why not stick to a single argument (`sql_ctx` or `session` ‒ it might require some research, but I fairly sure it is usually passed as positional) and  
   
   ```python
   self.session = session if isinstance(sql_ctx, SparkSession) else sql_ctx.sparkSession
   ```
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [WIP][SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r802523023



##########
File path: python/pyspark/sql/session.py
##########
@@ -230,11 +230,6 @@ def enableHiveSupport(self) -> "SparkSession.Builder":
             """
             return self.config("spark.sql.catalogImplementation", "hive")
 
-        def _sparkContext(self, sc: SparkContext) -> "SparkSession.Builder":
-            with self._lock:
-                self._sc = sc
-                return self

Review comment:
       We only have one Spark context running globally now. It doesn't make sense to make it able to set Spark context anymore.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #35410:
URL: https://github.com/apache/spark/pull/35410#issuecomment-1039724931


   Thanks all for reviewing this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r806358814



##########
File path: python/pyspark/sql/context.py
##########
@@ -714,14 +711,13 @@ def __init__(self, sparkContext: SparkContext, jhiveContext: Optional[JavaObject
             + "SparkSession.builder.enableHiveSupport().getOrCreate() instead.",
             FutureWarning,
         )
+        static_conf = {}
         if jhiveContext is None:
-            sparkContext._conf.set(  # type: ignore[attr-defined]
-                "spark.sql.catalogImplementation", "hive"
-            )
-            sparkSession = SparkSession.builder._sparkContext(sparkContext).getOrCreate()
-        else:
-            sparkSession = SparkSession(sparkContext, jhiveContext.sparkSession())
-        SQLContext.__init__(self, sparkContext, sparkSession, jhiveContext)
+            static_conf = {"spark.sql.catalogImplementation": "in-memory"}

Review comment:
       Shoot. I pushed a wrong change in the last minute - I was manually testing if this value is correctly set or not. I has to be `hive`. This uses an existing session that has `hive` by default so it doesn't affect the change but would impact shell if users use `HiveContext` directly.
   
   Sorry this was my mistake. I will make a quick followup.

##########
File path: python/pyspark/sql/context.py
##########
@@ -714,14 +711,13 @@ def __init__(self, sparkContext: SparkContext, jhiveContext: Optional[JavaObject
             + "SparkSession.builder.enableHiveSupport().getOrCreate() instead.",
             FutureWarning,
         )
+        static_conf = {}
         if jhiveContext is None:
-            sparkContext._conf.set(  # type: ignore[attr-defined]
-                "spark.sql.catalogImplementation", "hive"
-            )
-            sparkSession = SparkSession.builder._sparkContext(sparkContext).getOrCreate()
-        else:
-            sparkSession = SparkSession(sparkContext, jhiveContext.sparkSession())
-        SQLContext.__init__(self, sparkContext, sparkSession, jhiveContext)
+            static_conf = {"spark.sql.catalogImplementation": "in-memory"}

Review comment:
       Shoot. I pushed a wrong change in the last minute - I was manually testing if this value is correctly set or not. I has to be `hive`. This uses an existing session that has `hive` by default in tests so it doesn't affect the tests but would impact shell if users use `HiveContext` directly.
   
   Sorry this was my mistake. I will make a quick followup.

##########
File path: python/pyspark/sql/context.py
##########
@@ -714,14 +711,13 @@ def __init__(self, sparkContext: SparkContext, jhiveContext: Optional[JavaObject
             + "SparkSession.builder.enableHiveSupport().getOrCreate() instead.",
             FutureWarning,
         )
+        static_conf = {}
         if jhiveContext is None:
-            sparkContext._conf.set(  # type: ignore[attr-defined]
-                "spark.sql.catalogImplementation", "hive"
-            )
-            sparkSession = SparkSession.builder._sparkContext(sparkContext).getOrCreate()
-        else:
-            sparkSession = SparkSession(sparkContext, jhiveContext.sparkSession())
-        SQLContext.__init__(self, sparkContext, sparkSession, jhiveContext)
+            static_conf = {"spark.sql.catalogImplementation": "in-memory"}

Review comment:
       Shoot. I pushed a wrong change in the last minute - I was manually testing if this value is correctly set or not. I has to be `hive`. This uses an existing session that has `hive` by default in tests so it doesn't affect the tests but would impact shell if users use `HiveContext` directly without the existing Spark session.
   
   Sorry this was my mistake. I will make a quick followup.

##########
File path: python/pyspark/sql/context.py
##########
@@ -714,14 +711,13 @@ def __init__(self, sparkContext: SparkContext, jhiveContext: Optional[JavaObject
             + "SparkSession.builder.enableHiveSupport().getOrCreate() instead.",
             FutureWarning,
         )
+        static_conf = {}
         if jhiveContext is None:
-            sparkContext._conf.set(  # type: ignore[attr-defined]
-                "spark.sql.catalogImplementation", "hive"
-            )
-            sparkSession = SparkSession.builder._sparkContext(sparkContext).getOrCreate()
-        else:
-            sparkSession = SparkSession(sparkContext, jhiveContext.sparkSession())
-        SQLContext.__init__(self, sparkContext, sparkSession, jhiveContext)
+            static_conf = {"spark.sql.catalogImplementation": "in-memory"}

Review comment:
       https://github.com/apache/spark/pull/35517

##########
File path: python/pyspark/sql/context.py
##########
@@ -714,14 +711,13 @@ def __init__(self, sparkContext: SparkContext, jhiveContext: Optional[JavaObject
             + "SparkSession.builder.enableHiveSupport().getOrCreate() instead.",
             FutureWarning,
         )
+        static_conf = {}
         if jhiveContext is None:
-            sparkContext._conf.set(  # type: ignore[attr-defined]
-                "spark.sql.catalogImplementation", "hive"
-            )
-            sparkSession = SparkSession.builder._sparkContext(sparkContext).getOrCreate()
-        else:
-            sparkSession = SparkSession(sparkContext, jhiveContext.sparkSession())
-        SQLContext.__init__(self, sparkContext, sparkSession, jhiveContext)
+            static_conf = {"spark.sql.catalogImplementation": "in-memory"}

Review comment:
       Shoot. I pushed a wrong change in the last minute - I was manually testing if this value is correctly set or not. I has to be `hive`. This uses an existing session that has `hive` by default in tests so it doesn't affect the tests but would affect the production codes if users use `HiveContext` directly without the existing Spark session.
   
   Sorry this was my mistake. I will make a quick followup.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r804378522



##########
File path: python/pyspark/sql/context.py
##########
@@ -164,17 +158,20 @@ def getOrCreate(cls: Type["SQLContext"], sc: SparkContext) -> "SQLContext":
             "Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.",
             FutureWarning,
         )
+        return cls._get_or_create(sc)
+
+    @classmethod
+    def _get_or_create(cls: Type["SQLContext"], sc: SparkContext) -> "SQLContext":

Review comment:
       This is actually called `_get_or_create ` in `shell.py` ... to avoid unnecessary warning in PySpark shell.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r804365905



##########
File path: python/pyspark/sql/context.py
##########
@@ -718,10 +715,10 @@ def __init__(self, sparkContext: SparkContext, jhiveContext: Optional[JavaObject
             sparkContext._conf.set(  # type: ignore[attr-defined]
                 "spark.sql.catalogImplementation", "hive"
             )
-            sparkSession = SparkSession.builder._sparkContext(sparkContext).getOrCreate()
-        else:
-            sparkSession = SparkSession(sparkContext, jhiveContext.sparkSession())

Review comment:
       `spark.sql.catalogImplementation` is a static configuration that can only be set for the root parent Spark session. Therefore, we can just get the any active child Spark session. It doesn't have to be the Spark session in `HiveContext`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon removed a comment on pull request #35410: [WIP][SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon removed a comment on pull request #35410:
URL: https://github.com/apache/spark/pull/35410#issuecomment-1031334185


   will take another look tmr to clean up


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [WIP][SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r802524328



##########
File path: python/pyspark/sql/tests/test_session.py
##########
@@ -224,28 +224,26 @@ def tearDown(self):
     def test_sqlcontext_with_stopped_sparksession(self):
         # SPARK-30856: test that SQLContext.getOrCreate() returns a usable instance after
         # the SparkSession is restarted.
-        sql_context = self.spark._wrapped
+        sql_context = SQLContext.getOrCreate(self.spark.sparkContext)
         self.spark.stop()
-        sc = SparkContext("local[4]", self.sc.appName)
-        spark = SparkSession(sc)  # Instantiate the underlying SQLContext
-        new_sql_context = spark._wrapped
+        spark = SparkSession.builder.master("local[4]").appName(self.sc.appName).getOrCreate()
+        new_sql_context = SQLContext.getOrCreate(spark.sparkContext)
 
         self.assertIsNot(new_sql_context, sql_context)
-        self.assertIs(SQLContext.getOrCreate(sc).sparkSession, spark)
+        self.assertIs(SQLContext.getOrCreate(spark.sparkContext).sparkSession, spark)
         try:
             df = spark.createDataFrame([(1, 2)], ["c", "c"])
             df.collect()
         finally:
             spark.stop()
             self.assertIsNone(SQLContext._instantiatedContext)
-            sc.stop()

Review comment:
       `spark.stop()` calls `sc.stop()`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #35410:
URL: https://github.com/apache/spark/pull/35410#issuecomment-1035730260


   I will do some manual tests to make sure but I believe this is ready for a review. Adding @viirya @ueshin @BryanCutler in case you guys find some time to review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r804375073



##########
File path: python/pyspark/sql/context.py
##########
@@ -164,17 +158,20 @@ def getOrCreate(cls: Type["SQLContext"], sc: SparkContext) -> "SQLContext":
             "Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.",
             FutureWarning,
         )
+        return cls._get_or_create(sc)
+
+    @classmethod
+    def _get_or_create(cls: Type["SQLContext"], sc: SparkContext) -> "SQLContext":

Review comment:
       Hmm, the above `getOrCreate` is deprecated for a while, why adding a new `_get_or_create`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r806358814



##########
File path: python/pyspark/sql/context.py
##########
@@ -714,14 +711,13 @@ def __init__(self, sparkContext: SparkContext, jhiveContext: Optional[JavaObject
             + "SparkSession.builder.enableHiveSupport().getOrCreate() instead.",
             FutureWarning,
         )
+        static_conf = {}
         if jhiveContext is None:
-            sparkContext._conf.set(  # type: ignore[attr-defined]
-                "spark.sql.catalogImplementation", "hive"
-            )
-            sparkSession = SparkSession.builder._sparkContext(sparkContext).getOrCreate()
-        else:
-            sparkSession = SparkSession(sparkContext, jhiveContext.sparkSession())
-        SQLContext.__init__(self, sparkContext, sparkSession, jhiveContext)
+            static_conf = {"spark.sql.catalogImplementation": "in-memory"}

Review comment:
       Shoot. I pushed a wrong change in the last minute - I was manually testing if this value is correctly set or not. I has to be `hive`. This uses an existing session that has `hive` by default so it doesn't affect the change but would impact shell if users use `HiveContext` directly.
   
   Sorry this was my mistake. I will make a quick followup.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

zero323 commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r805166064



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -88,9 +88,10 @@ def toPandas(self) -> "PandasDataFrameLike":
         import pandas as pd
         from pandas.core.dtypes.common import is_timedelta64_dtype
 
-        timezone = self.sql_ctx._conf.sessionLocalTimeZone()  # type: ignore[attr-defined]
+        jconf = self.sparkSession._jconf  # type: ignore[attr-defined]
+        timezone = jconf.sessionLocalTimeZone()  # type: ignore[attr-defined]

Review comment:
       This one, and the following, also once `SparkSession._jconf` return type is fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

zero323 commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r805165758



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -88,9 +88,10 @@ def toPandas(self) -> "PandasDataFrameLike":
         import pandas as pd
         from pandas.core.dtypes.common import is_timedelta64_dtype
 
-        timezone = self.sql_ctx._conf.sessionLocalTimeZone()  # type: ignore[attr-defined]
+        jconf = self.sparkSession._jconf  # type: ignore[attr-defined]

Review comment:
       This `ignore` shouldn't be necessary.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #35410:
URL: https://github.com/apache/spark/pull/35410#issuecomment-1037173841






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zsxwing commented on a change in pull request #35410: [SPARK-38121][PYTHON][SQL] Use SparkSession instead of SQLContext inside PySpark

Posted by GitBox <gi...@apache.org>.

zsxwing commented on a change in pull request #35410:
URL: https://github.com/apache/spark/pull/35410#discussion_r810321103



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -116,13 +140,45 @@ def __init__(self, jdf: JavaObject, sql_ctx: "SQLContext"):
         # by __repr__ and _repr_html_ while eager evaluation opened.
         self._support_repr_html = False
 
+    @property
+    def sql_ctx(self) -> "SQLContext":
+        from pyspark.sql.context import SQLContext
+
+        warnings.warn(
+            "DataFrame.sql_ctx is an internal property, and will be removed "
+            "in future releases. Use DataFrame.sparkSession instead."
+        )
+        if self._sql_ctx is None:
+            self._sql_ctx = SQLContext._get_or_create(self._sc)
+        return self._sql_ctx
+
+    @property  # type: ignore[misc]
+    def sparkSession(self) -> "SparkSession":
+        """Returns Spark session that created this :class:`DataFrame`.
+
+        .. versionadded:: 3.3.0
+
+        Examples
+        --------
+        >>> df = spark.range(1)
+        >>> type(df.sparkSession)
+        <class 'pyspark.sql.session.SparkSession'>
+        """
+        from pyspark.sql.session import SparkSession
+
+        if self._session is None:
+            self._session = SparkSession._getActiveSessionOrCreate()

Review comment:
       This looks like not correct. The current DataFrame's `sparkSession` may not be the active session. Why not get the session from `self.__sql_ctx` and assign it to `self._session` in the constructor?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org