You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/12/14 07:22:37 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

HyukjinKwon opened a new pull request #34893:
URL: https://github.com/apache/spark/pull/34893


   ### What changes were proposed in this pull request?
   
   This PR proposes to use an existing active Spark session instead of `SparkSession.getOrCreate` in pandas API on Spark.
   
   ### Why are the changes needed?
   
   Because it shows warnings for configurations not taking effect as below:
   
   Otherwise, it attempts to create a new session, and shows warnings as below:
   
   ```python
   >>> ps.range(10)
   21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the static sql configurations will not take effect.
   21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some spark core configurations may not take effect.
   21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the static sql configurations will not take effect.
   21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some spark core configurations may not take effect.
   21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the static sql configurations will not take effect.
   21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some spark core configurations may not take effect.
   21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the static sql configurations will not take effect.
   21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some spark core configurations may not take effect.
   ...
      id
   0   0
   1   1
   2   2
   3   3
   4   4
   5   5
   6   6
   7   7
   8   8
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, after this PR, it will explicitly uses active Spark session, and does not show such warnings:
   
   ```python
   >>> import pyspark.pandas as ps
   >>> ps.range(10)
   ...
      id
   0   0
   1   1
   2   2
   3   3
   ```
   
   ### How was this patch tested?
   
   Manually tested as below:
   
   ```python
   import pyspark.pandas as ps
   ps.range(10)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #34893:
URL: https://github.com/apache/spark/pull/34893


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993479620


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50653/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ueshin commented on a change in pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

ueshin commented on a change in pull request #34893:
URL: https://github.com/apache/spark/pull/34893#discussion_r769196943



##########
File path: python/pyspark/pandas/utils.py
##########
@@ -464,20 +464,12 @@ def is_testing() -> bool:
     return "SPARK_TESTING" in os.environ
 
 
-def default_session(conf: Optional[Dict[str, Any]] = None) -> SparkSession:
-    if conf is None:
-        conf = dict()
+def default_session() -> SparkSession:
+    spark = SparkSession.getActiveSession()
+    if spark is not None:
+        return spark
 
     builder = SparkSession.builder.appName("pandas-on-Spark")
-    for key, value in conf.items():
-        builder = builder.config(key, value)
-    # Currently, pandas-on-Spark is dependent on such join due to 'compute.ops_on_diff_frames'
-    # configuration. This is needed with Spark 3.0+.
-    builder.config("spark.sql.analyzer.failAmbiguousSelfJoin", False)
-
-    if is_testing():
-        builder.config("spark.executor.allowSparkContext", False)

Review comment:
       Actually this was added for our test to check our code doesn't create `SparkContext` in executors.
   But we can remove it anyway because the default value is `False` now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993265363


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146171/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34893:
URL: https://github.com/apache/spark/pull/34893#discussion_r768500829



##########
File path: python/pyspark/pandas/utils.py
##########
@@ -464,20 +464,12 @@ def is_testing() -> bool:
     return "SPARK_TESTING" in os.environ
 
 
-def default_session(conf: Optional[Dict[str, Any]] = None) -> SparkSession:
-    if conf is None:
-        conf = dict()
+def default_session() -> SparkSession:
+    spark = SparkSession.getActiveSession()
+    if spark is not None:
+        return spark
 
     builder = SparkSession.builder.appName("pandas-on-Spark")
-    for key, value in conf.items():
-        builder = builder.config(key, value)
-    # Currently, pandas-on-Spark is dependent on such join due to 'compute.ops_on_diff_frames'
-    # configuration. This is needed with Spark 3.0+.
-    builder.config("spark.sql.analyzer.failAmbiguousSelfJoin", False)

Review comment:
       In fact, we fixed this bug in the master branch so we don't need to set this anymore




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993232643


   **[Test build #146171 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146171/testReport)** for PR 34893 at commit [`39acf99`](https://github.com/apache/spark/commit/39acf9901644027aa1cbc0a161afdc44b1f9eae6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993232098


   cc @xinrong-databricks and @ueshin FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993387873


   **[Test build #146180 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146180/testReport)** for PR 34893 at commit [`ce0cd8b`](https://github.com/apache/spark/commit/ce0cd8be5340b6b8804c50ffbe2ec92036ece341).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993439239


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146180/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993439239


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146180/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993433189


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50653/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-994238174


   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993438590


   **[Test build #146180 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146180/testReport)** for PR 34893 at commit [`ce0cd8b`](https://github.com/apache/spark/commit/ce0cd8be5340b6b8804c50ffbe2ec92036ece341).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993482157


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50653/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993318339


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50644/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ueshin commented on a change in pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

ueshin commented on a change in pull request #34893:
URL: https://github.com/apache/spark/pull/34893#discussion_r768958909



##########
File path: python/pyspark/pandas/utils.py
##########
@@ -464,20 +464,12 @@ def is_testing() -> bool:
     return "SPARK_TESTING" in os.environ
 
 
-def default_session(conf: Optional[Dict[str, Any]] = None) -> SparkSession:
-    if conf is None:
-        conf = dict()
+def default_session() -> SparkSession:
+    spark = SparkSession.getActiveSession()
+    if spark is not None:
+        return spark
 
     builder = SparkSession.builder.appName("pandas-on-Spark")
-    for key, value in conf.items():
-        builder = builder.config(key, value)
-    # Currently, pandas-on-Spark is dependent on such join due to 'compute.ops_on_diff_frames'
-    # configuration. This is needed with Spark 3.0+.
-    builder.config("spark.sql.analyzer.failAmbiguousSelfJoin", False)
-
-    if is_testing():
-        builder.config("spark.executor.allowSparkContext", False)

Review comment:
       Where is this set?
   This is for our testing to check our internal code doesn't create `SparkSession` in executors.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34893:
URL: https://github.com/apache/spark/pull/34893#discussion_r769145116



##########
File path: python/pyspark/pandas/utils.py
##########
@@ -464,20 +464,12 @@ def is_testing() -> bool:
     return "SPARK_TESTING" in os.environ
 
 
-def default_session(conf: Optional[Dict[str, Any]] = None) -> SparkSession:
-    if conf is None:
-        conf = dict()
+def default_session() -> SparkSession:
+    spark = SparkSession.getActiveSession()
+    if spark is not None:
+        return spark
 
     builder = SparkSession.builder.appName("pandas-on-Spark")
-    for key, value in conf.items():
-        builder = builder.config(key, value)
-    # Currently, pandas-on-Spark is dependent on such join due to 'compute.ops_on_diff_frames'
-    # configuration. This is needed with Spark 3.0+.
-    builder.config("spark.sql.analyzer.failAmbiguousSelfJoin", False)
-
-    if is_testing():
-        builder.config("spark.executor.allowSparkContext", False)

Review comment:
       and in fact this was set when we run tests with `pytest` when it was in Koalas repo.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993482157


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50653/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ueshin commented on a change in pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

ueshin commented on a change in pull request #34893:
URL: https://github.com/apache/spark/pull/34893#discussion_r769145327



##########
File path: python/pyspark/pandas/utils.py
##########
@@ -464,20 +464,12 @@ def is_testing() -> bool:
     return "SPARK_TESTING" in os.environ
 
 
-def default_session(conf: Optional[Dict[str, Any]] = None) -> SparkSession:
-    if conf is None:
-        conf = dict()
+def default_session() -> SparkSession:

Review comment:
       I'm not sure we can remove the `conf` argument here?
   I guess we should show a deprecation warning if it's not `None` for now and remove it in the future?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34893:
URL: https://github.com/apache/spark/pull/34893#discussion_r768503399



##########
File path: python/pyspark/pandas/utils.py
##########
@@ -464,20 +464,12 @@ def is_testing() -> bool:
     return "SPARK_TESTING" in os.environ
 
 
-def default_session(conf: Optional[Dict[str, Any]] = None) -> SparkSession:
-    if conf is None:
-        conf = dict()
+def default_session() -> SparkSession:
+    spark = SparkSession.getActiveSession()
+    if spark is not None:
+        return spark
 
     builder = SparkSession.builder.appName("pandas-on-Spark")
-    for key, value in conf.items():
-        builder = builder.config(key, value)
-    # Currently, pandas-on-Spark is dependent on such join due to 'compute.ops_on_diff_frames'
-    # configuration. This is needed with Spark 3.0+.
-    builder.config("spark.sql.analyzer.failAmbiguousSelfJoin", False)
-
-    if is_testing():
-        builder.config("spark.executor.allowSparkContext", False)

Review comment:
       and this will be set separately when SparkContext is created, not here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34893:
URL: https://github.com/apache/spark/pull/34893#discussion_r769154522



##########
File path: python/pyspark/pandas/utils.py
##########
@@ -464,20 +464,12 @@ def is_testing() -> bool:
     return "SPARK_TESTING" in os.environ
 
 
-def default_session(conf: Optional[Dict[str, Any]] = None) -> SparkSession:
-    if conf is None:
-        conf = dict()
+def default_session() -> SparkSession:

Review comment:
       I think this isn't an API that's not documented so it should be fine.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ueshin commented on a change in pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

ueshin commented on a change in pull request #34893:
URL: https://github.com/apache/spark/pull/34893#discussion_r769197301



##########
File path: python/pyspark/pandas/utils.py
##########
@@ -464,20 +464,12 @@ def is_testing() -> bool:
     return "SPARK_TESTING" in os.environ
 
 
-def default_session(conf: Optional[Dict[str, Any]] = None) -> SparkSession:
-    if conf is None:
-        conf = dict()
+def default_session() -> SparkSession:

Review comment:
       I'd just leave it to you.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993265363


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146171/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993332602


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50644/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993249635


   **[Test build #146171 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146171/testReport)** for PR 34893 at commit [`39acf99`](https://github.com/apache/spark/commit/39acf9901644027aa1cbc0a161afdc44b1f9eae6).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993232643


   **[Test build #146171 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146171/testReport)** for PR 34893 at commit [`39acf99`](https://github.com/apache/spark/commit/39acf9901644027aa1cbc0a161afdc44b1f9eae6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993264325


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50644/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993332602


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50644/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34893:
URL: https://github.com/apache/spark/pull/34893#issuecomment-993387873


   **[Test build #146180 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146180/testReport)** for PR 34893 at commit [`ce0cd8b`](https://github.com/apache/spark/commit/ce0cd8be5340b6b8804c50ffbe2ec92036ece341).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ueshin commented on a change in pull request #34893: [SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark

Posted by GitBox <gi...@apache.org>.

ueshin commented on a change in pull request #34893:
URL: https://github.com/apache/spark/pull/34893#discussion_r768958909



##########
File path: python/pyspark/pandas/utils.py
##########
@@ -464,20 +464,12 @@ def is_testing() -> bool:
     return "SPARK_TESTING" in os.environ
 
 
-def default_session(conf: Optional[Dict[str, Any]] = None) -> SparkSession:
-    if conf is None:
-        conf = dict()
+def default_session() -> SparkSession:
+    spark = SparkSession.getActiveSession()
+    if spark is not None:
+        return spark
 
     builder = SparkSession.builder.appName("pandas-on-Spark")
-    for key, value in conf.items():
-        builder = builder.config(key, value)
-    # Currently, pandas-on-Spark is dependent on such join due to 'compute.ops_on_diff_frames'
-    # configuration. This is needed with Spark 3.0+.
-    builder.config("spark.sql.analyzer.failAmbiguousSelfJoin", False)
-
-    if is_testing():
-        builder.config("spark.executor.allowSparkContext", False)

Review comment:
       Where is this set?
   This is for our testing to check our internal code doesn't create `SparkSession` in executors.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org