You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "xinrong-meng (via GitHub)" <gi...@apache.org> on 2024/02/27 00:56:14 UTC

[PR] [WIP] Documentation for SparkSession-based Profilers [spark]

xinrong-meng opened a new pull request, #45269:
URL: https://github.com/apache/spark/pull/45269

   ### What changes were proposed in this pull request?
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on code in PR #45269:
URL: https://github.com/apache/spark/pull/45269#discussion_r1513425698


##########
python/docs/source/reference/pyspark.sql/spark_session.rst:
##########
@@ -49,6 +49,7 @@ See also :class:`SparkSession`.
     SparkSession.createDataFrame
     SparkSession.getActiveSession
     SparkSession.newSession
+    SparkSession.profile

Review Comment:
   I think SparkSession.builder works because it is a classproperty whereas profile is a property of SparkSession.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on code in PR #45269:
URL: https://github.com/apache/spark/pull/45269#discussion_r1513424368


##########
python/docs/source/reference/pyspark.sql/spark_session.rst:
##########
@@ -49,6 +49,7 @@ See also :class:`SparkSession`.
     SparkSession.createDataFrame
     SparkSession.getActiveSession
     SparkSession.newSession
+    SparkSession.profile

Review Comment:
   Hmm I was thinking the same but it kept failing with the error message..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #45269:
URL: https://github.com/apache/spark/pull/45269#discussion_r1506903959


##########
python/docs/source/reference/pyspark.sql/spark_session.rst:
##########
@@ -49,6 +49,7 @@ See also :class:`SparkSession`.
     SparkSession.createDataFrame
     SparkSession.getActiveSession
     SparkSession.newSession
+    SparkSession.profile

Review Comment:
   I think we should also have a dedicated section for profile.show, profile.dump.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on code in PR #45269:
URL: https://github.com/apache/spark/pull/45269#discussion_r1511705327


##########
python/docs/source/reference/pyspark.sql/spark_session.rst:
##########
@@ -49,6 +49,7 @@ See also :class:`SparkSession`.
     SparkSession.createDataFrame
     SparkSession.getActiveSession
     SparkSession.newSession
+    SparkSession.profile

Review Comment:
   Sounds good. Updated here https://github.com/apache/spark/pull/45269/files#diff-1d5123b540315e1c678a3c7f5af287076c8296f71230592990c344933d02f664R90.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on code in PR #45269:
URL: https://github.com/apache/spark/pull/45269#discussion_r1511707998


##########
python/docs/source/development/debugging.rst:
##########
@@ -341,7 +372,12 @@ Python/Pandas UDF
 ~~~~~~~~~~~~~~~~~
 
 To use this on Python/Pandas UDFs, PySpark provides remote `Python Profilers <https://docs.python.org/3/library/profile.html>`_ for
-Python/Pandas UDFs, which can be enabled by setting ``spark.python.profile`` configuration to ``true``.
+Python/Pandas UDFs.
+
+SparkContext-based

Review Comment:
   We will remove "legacy" profilers for readability and clarity and start preparing migration guide.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #45269:
URL: https://github.com/apache/spark/pull/45269#discussion_r1511898364


##########
python/docs/source/reference/pyspark.sql/spark_session.rst:
##########
@@ -49,6 +49,7 @@ See also :class:`SparkSession`.
     SparkSession.createDataFrame
     SparkSession.getActiveSession
     SparkSession.newSession
+    SparkSession.profile

Review Comment:
   Need
   
   ```
   :template: autosummary/accessor_method.rst
   ```
   
   ?
   
   See https://github.com/apache/spark/pull/44012#discussion_r1405231062



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on code in PR #45269:
URL: https://github.com/apache/spark/pull/45269#discussion_r1508113784


##########
python/docs/source/development/debugging.rst:
##########
@@ -341,7 +372,12 @@ Python/Pandas UDF
 ~~~~~~~~~~~~~~~~~
 
 To use this on Python/Pandas UDFs, PySpark provides remote `Python Profilers <https://docs.python.org/3/library/profile.html>`_ for
-Python/Pandas UDFs, which can be enabled by setting ``spark.python.profile`` configuration to ``true``.
+Python/Pandas UDFs.
+
+SparkContext-based

Review Comment:
   I believe there are many existing users of SparkContext-based profilers. Shall we keep it in the debugging guide until SparkSession-based profilers gain more adoption and positive feedbacks? I'll adjust the order to show SparkSession-based profilers first as @ueshin suggested. What do you think @HyukjinKwon?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on code in PR #45269:
URL: https://github.com/apache/spark/pull/45269#discussion_r1511840961


##########
python/docs/source/reference/pyspark.sql/spark_session.rst:
##########
@@ -49,6 +49,7 @@ See also :class:`SparkSession`.
     SparkSession.createDataFrame
     SparkSession.getActiveSession
     SparkSession.newSession
+    SparkSession.profile

Review Comment:
   I hit
   ```
   [autosummary] failed to import pyspark.sql.SparkSession.profile.dump.
   Possible hints:
   * AttributeError: 'property' object has no attribute 'dump'
   * ImportError: 
   * ModuleNotFoundError: No module named 'pyspark.sql.SparkSession'
   ```
   The profile property returns a Profile class instance, Sphinx might have difficulty accessing it. Do you happen to know the best way to resolve that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on code in PR #45269:
URL: https://github.com/apache/spark/pull/45269#discussion_r1513550672


##########
python/docs/source/reference/pyspark.sql/spark_session.rst:
##########
@@ -49,6 +49,7 @@ See also :class:`SparkSession`.
     SparkSession.createDataFrame
     SparkSession.getActiveSession
     SparkSession.newSession
+    SparkSession.profile

Review Comment:
   I have a workaround [76e7387](https://github.com/apache/spark/pull/45269/commits/76e738768d591ff59e9f14210bbccecc9458896a) by using autoclass, but it doesn't look consistent with the rest of the page, as shown below.
   
   ![image](https://github.com/apache/spark/assets/47337188/ab2e8f7b-2d65-4788-af47-354bd66a6fa2)
   
   I'm wondering if we should have a follow-up designated for that part.
   



##########
python/docs/source/development/debugging.rst:
##########
@@ -341,7 +372,12 @@ Python/Pandas UDF
 ~~~~~~~~~~~~~~~~~
 
 To use this on Python/Pandas UDFs, PySpark provides remote `Python Profilers <https://docs.python.org/3/library/profile.html>`_ for
-Python/Pandas UDFs, which can be enabled by setting ``spark.python.profile`` configuration to ``true``.
+Python/Pandas UDFs.
+
+SparkContext-based

Review Comment:
   I have a workaround [76e7387](https://github.com/apache/spark/pull/45269/commits/76e738768d591ff59e9f14210bbccecc9458896a) by using autoclass, but it doesn't look consistent with the rest of the page, as shown below.
   
   ![image](https://github.com/apache/spark/assets/47337188/ab2e8f7b-2d65-4788-af47-354bd66a6fa2)
   
   I'm wondering if we should have a follow-up designated for that part.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #45269:
URL: https://github.com/apache/spark/pull/45269#discussion_r1506906556


##########
python/docs/source/development/debugging.rst:
##########
@@ -341,7 +372,12 @@ Python/Pandas UDF
 ~~~~~~~~~~~~~~~~~
 
 To use this on Python/Pandas UDFs, PySpark provides remote `Python Profilers <https://docs.python.org/3/library/profile.html>`_ for
-Python/Pandas UDFs, which can be enabled by setting ``spark.python.profile`` configuration to ``true``.
+Python/Pandas UDFs.
+
+SparkContext-based

Review Comment:
   cc @ueshin do you have other thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #45269: [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers
URL: https://github.com/apache/spark/pull/45269


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on code in PR #45269:
URL: https://github.com/apache/spark/pull/45269#discussion_r1513539612


##########
python/docs/source/development/debugging.rst:
##########
@@ -341,7 +372,12 @@ Python/Pandas UDF
 ~~~~~~~~~~~~~~~~~
 
 To use this on Python/Pandas UDFs, PySpark provides remote `Python Profilers <https://docs.python.org/3/library/profile.html>`_ for
-Python/Pandas UDFs, which can be enabled by setting ``spark.python.profile`` configuration to ``true``.
+Python/Pandas UDFs.
+
+SparkContext-based

Review Comment:
   I have a workaround [76e7387](https://github.com/apache/spark/pull/45269/commits/76e738768d591ff59e9f14210bbccecc9458896a) by using autoclass, but it doesn't look consistent with the rest of the page, as shown below.
   
   ![image](https://github.com/apache/spark/assets/47337188/ab2e8f7b-2d65-4788-af47-354bd66a6fa2)
   
   I'm wondering if we should have a follow-up designated for that part.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP] Documentation for SparkSession-based Profilers [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on PR #45269:
URL: https://github.com/apache/spark/pull/45269#issuecomment-1969683363

   I was looking for the API doc.. thank you @HyukjinKwon !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #45269:
URL: https://github.com/apache/spark/pull/45269#discussion_r1506905947


##########
python/docs/source/development/debugging.rst:
##########
@@ -341,7 +372,12 @@ Python/Pandas UDF
 ~~~~~~~~~~~~~~~~~
 
 To use this on Python/Pandas UDFs, PySpark provides remote `Python Profilers <https://docs.python.org/3/library/profile.html>`_ for
-Python/Pandas UDFs, which can be enabled by setting ``spark.python.profile`` configuration to ``true``.
+Python/Pandas UDFs.
+
+SparkContext-based

Review Comment:
   I think you can just leave this as are, and just add one additional section called runtime profiler



##########
python/docs/source/development/debugging.rst:
##########
@@ -341,7 +372,12 @@ Python/Pandas UDF
 ~~~~~~~~~~~~~~~~~
 
 To use this on Python/Pandas UDFs, PySpark provides remote `Python Profilers <https://docs.python.org/3/library/profile.html>`_ for
-Python/Pandas UDFs, which can be enabled by setting ``spark.python.profile`` configuration to ``true``.
+Python/Pandas UDFs.
+
+SparkContext-based

Review Comment:
   I think you can just remove this, and just add one additional section called runtime profiler



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #45269:
URL: https://github.com/apache/spark/pull/45269#discussion_r1508099936


##########
python/docs/source/development/debugging.rst:
##########
@@ -341,7 +372,12 @@ Python/Pandas UDF
 ~~~~~~~~~~~~~~~~~
 
 To use this on Python/Pandas UDFs, PySpark provides remote `Python Profilers <https://docs.python.org/3/library/profile.html>`_ for
-Python/Pandas UDFs, which can be enabled by setting ``spark.python.profile`` configuration to ``true``.
+Python/Pandas UDFs.
+
+SparkContext-based

Review Comment:
   How about put the new doc to the first place?
   
   - Identifying Hot Loops (Python Profilers)
       - Driver Side
       ...
       - Executor Side
           - Python/Pandas UDF
           Show the new profiler usage
           - Legacy (for RDD or non-Spark Connect)
           Put the current doc here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #45269:
URL: https://github.com/apache/spark/pull/45269#issuecomment-1984848824

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on PR #45269:
URL: https://github.com/apache/spark/pull/45269#issuecomment-1979532670

   Marked WIP to wait for https://github.com/apache/spark/pull/45378 merged first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org