You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "xinrong-meng (via GitHub)" <gi...@apache.org> on 2024/03/05 01:17:46 UTC

[PR] [WIP] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

xinrong-meng opened a new pull request, #45378:
URL: https://github.com/apache/spark/pull/45378

   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng closed pull request #45378: [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling
URL: https://github.com/apache/spark/pull/45378


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1513759920


##########
python/pyspark/sql/profiler.py:
##########
@@ -224,6 +224,54 @@ def dump(id: int) -> None:
             for id in sorted(code_map.keys()):
                 dump(id)
 
+    def clear_perf_profiles(self, id: Optional[int] = None) -> None:
+        """
+        Clear the perf profile results.
+
+        .. versionadded:: 4.0.0
+
+        Parameters
+        ----------
+        id : int, optional
+            The UDF ID whose profiling results should be cleared.
+            If not specified, all the results will be cleared.
+        """
+        ids_to_remove = [
+            result_id
+            for result_id, (perf, _, *_) in self._profile_results.items()
+            if perf is not None
+        ]
+        with self._lock:
+            if id is not None:
+                if id in ids_to_remove:
+                    self._profile_results.pop(id, None)

Review Comment:
   Seems to be removing `memory` as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1513693941


##########
python/pyspark/sql/profiler.py:
##########
@@ -224,6 +224,54 @@ def dump(id: int) -> None:
             for id in sorted(code_map.keys()):
                 dump(id)
 
+    def clear_perf_profiles(self, id: Optional[int] = None) -> None:
+        """
+        Clear the perf profile results.
+
+        .. versionadded:: 4.0.0

Review Comment:
   Is this a user-facing API? If not, we don't need this version directive



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1515212345


##########
python/pyspark/sql/profiler.py:
##########
@@ -224,6 +224,54 @@ def dump(id: int) -> None:
             for id in sorted(code_map.keys()):
                 dump(id)
 
+    def clear_perf_profiles(self, id: Optional[int] = None) -> None:
+        """
+        Clear the perf profile results.
+
+        .. versionadded:: 4.0.0
+
+        Parameters
+        ----------
+        id : int, optional
+            The UDF ID whose profiling results should be cleared.
+            If not specified, all the results will be cleared.
+        """
+        ids_to_remove = [
+            result_id
+            for result_id, (perf, _, *_) in self._profile_results.items()
+            if perf is not None
+        ]
+        with self._lock:
+            if id is not None:
+                if id in ids_to_remove:
+                    self._profile_results.pop(id, None)

Review Comment:
   Good catch! Thanks for the example. I adjusted the code and added tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1515240180


##########
python/pyspark/sql/profiler.py:
##########
@@ -236,18 +236,22 @@ def clear_perf_profiles(self, id: Optional[int] = None) -> None:
             The UDF ID whose profiling results should be cleared.
             If not specified, all the results will be cleared.
         """
-        ids_to_remove = [
-            result_id
-            for result_id, (perf, _, *_) in self._profile_results.items()
-            if perf is not None
-        ]
         with self._lock:
             if id is not None:
-                if id in ids_to_remove:
-                    self._profile_results.pop(id, None)
+                if id in self._profile_results:
+                    perf, mem, *rest = self._profile_results[id]
+                    self._profile_results[id] = (None, mem, *rest)
+                    if mem is None:
+                        self._profile_results.pop(id, None)
             else:
-                for id_to_remove in ids_to_remove:
-                    self._profile_results.pop(id_to_remove, None)
+                ids_to_remove = []
+                for id, (perf, mem, *rest) in list(self._profile_results.items()):
+                    self._profile_results[id] = (None, mem, *rest)
+                    if mem is None:
+                        ids_to_remove.append(id)

Review Comment:
   nit: Can't we pop it here?



##########
python/pyspark/sql/profiler.py:
##########
@@ -262,15 +266,21 @@ def clear_memory_profiles(self, id: Optional[int] = None) -> None:
             If not specified, all the results will be cleared.
         """
         with self._lock:
-            ids_to_remove = [
-                id for id, (_, mem, *_) in self._profile_results.items() if mem is not None
-            ]
             if id is not None:
-                if id in ids_to_remove:
-                    self._profile_results.pop(id, None)
+                if id in self._profile_results:
+                    perf, mem, *rest = self._profile_results[id]
+                    self._profile_results[id] = (perf, None, *rest)
+                    if perf is None:
+                        self._profile_results.pop(id, None)
             else:
-                for id_to_remove in ids_to_remove:
-                    self._profile_results.pop(id_to_remove, None)
+                ids_to_remove = []
+                for id, (perf, mem, *rest) in list(self._profile_results.items()):
+                    self._profile_results[id] = (perf, None, *rest)
+                    if perf is None:
+                        ids_to_remove.append(id)

Review Comment:
   ditto.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1516752307


##########
python/pyspark/sql/tests/test_session.py:
##########
@@ -531,6 +531,33 @@ def test_dump_invalid_type(self):
             },
         )
 
+    def test_clear_memory_type(self):

Review Comment:
   Good idea!
   
   For now, all logic tested by SparkSessionProfileTests is directly imported in Spark Connect with no modification. But I do agree separating it later will improve readability and ensure future parity. I'll refactor later. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1515363804


##########
python/pyspark/sql/profiler.py:
##########
@@ -236,18 +236,22 @@ def clear_perf_profiles(self, id: Optional[int] = None) -> None:
             The UDF ID whose profiling results should be cleared.
             If not specified, all the results will be cleared.
         """
-        ids_to_remove = [
-            result_id
-            for result_id, (perf, _, *_) in self._profile_results.items()
-            if perf is not None
-        ]
         with self._lock:
             if id is not None:
-                if id in ids_to_remove:
-                    self._profile_results.pop(id, None)
+                if id in self._profile_results:
+                    perf, mem, *rest = self._profile_results[id]
+                    self._profile_results[id] = (None, mem, *rest)
+                    if mem is None:
+                        self._profile_results.pop(id, None)
             else:
-                for id_to_remove in ids_to_remove:
-                    self._profile_results.pop(id_to_remove, None)
+                ids_to_remove = []
+                for id, (perf, mem, *rest) in list(self._profile_results.items()):
+                    self._profile_results[id] = (None, mem, *rest)
+                    if mem is None:
+                        ids_to_remove.append(id)

Review Comment:
   Good idea! Adjusted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1515475836


##########
python/pyspark/sql/tests/test_session.py:
##########
@@ -531,6 +531,33 @@ def test_dump_invalid_type(self):
             },
         )
 
+    def test_clear_memory_type(self):

Review Comment:
   nit, it seems we don't have a parity test for `test_session`. does it make sense to move `SparkSessionProfileTests` out of `test_session` and add parity test for it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1513714435


##########
python/pyspark/sql/profiler.py:
##########
@@ -224,6 +224,54 @@ def dump(id: int) -> None:
             for id in sorted(code_map.keys()):
                 dump(id)
 
+    def clear_perf_profiles(self, id: Optional[int] = None) -> None:
+        """
+        Clear the perf profile results.
+
+        .. versionadded:: 4.0.0

Review Comment:
   It is a user-facing API, along with `profile.show` and `profile.dump`. We will also add it to API doc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on PR #45378:
URL: https://github.com/apache/spark/pull/45378#issuecomment-1984523232

   Merged to master, thank you all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on PR #45378:
URL: https://github.com/apache/spark/pull/45378#issuecomment-1979697434

   Failed tests are irrelevant to changes proposed in this PR. Rerun failed tests https://github.com/xinrong-meng/spark/actions/runs/8162084262.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1515098224


##########
python/pyspark/sql/profiler.py:
##########
@@ -224,6 +224,54 @@ def dump(id: int) -> None:
             for id in sorted(code_map.keys()):
                 dump(id)
 
+    def clear_perf_profiles(self, id: Optional[int] = None) -> None:
+        """
+        Clear the perf profile results.
+
+        .. versionadded:: 4.0.0
+
+        Parameters
+        ----------
+        id : int, optional
+            The UDF ID whose profiling results should be cleared.
+            If not specified, all the results will be cleared.
+        """
+        ids_to_remove = [
+            result_id
+            for result_id, (perf, _, *_) in self._profile_results.items()
+            if perf is not None
+        ]
+        with self._lock:
+            if id is not None:
+                if id in ids_to_remove:
+                    self._profile_results.pop(id, None)

Review Comment:
   On Jupyter:
   
   ```py
   from pyspark.sql.functions import pandas_udf
   df = spark.range(3)
   
   @pandas_udf("long")
   def add1(x):
       return x + 1
   
   added = df.select(add1("id"))
   
   spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")
   added.show()
   
   spark.conf.set("spark.sql.pyspark.udf.profiler", "memory")
   added.show()
   
   spark.profile.show()
   ...
   
   spark.profile.clear(type="memory")
   
   spark.profile.show()  # should still show the perf results?
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1516752307


##########
python/pyspark/sql/tests/test_session.py:
##########
@@ -531,6 +531,33 @@ def test_dump_invalid_type(self):
             },
         )
 
+    def test_clear_memory_type(self):

Review Comment:
   Good idea!
   
   For now, all logic tested by SparkSessionProfileTests is directly imported in Spark Connect with no modification. But I do agree separating it later will improve readability. I'll refactor later. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47276][PYTHON][CONNECT] Introduce `spark.profile.clear` for SparkSession-based profiling [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on code in PR #45378:
URL: https://github.com/apache/spark/pull/45378#discussion_r1516750441


##########
python/pyspark/sql/profiler.py:
##########
@@ -224,6 +224,54 @@ def dump(id: int) -> None:
             for id in sorted(code_map.keys()):
                 dump(id)
 
+    def clear_perf_profiles(self, id: Optional[int] = None) -> None:
+        """
+        Clear the perf profile results.
+
+        .. versionadded:: 4.0.0

Review Comment:
   Actually this is not. The `clear` in `Profile` should be a user-facing API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org