You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xinrong Meng (Jira)" <ji...@apache.org> on 2022/10/05 00:43:00 UTC
[jira] [Updated] (SPARK-40281) Memory Profiler on Executors

     [ https://issues.apache.org/jira/browse/SPARK-40281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xinrong Meng updated SPARK-40281:
---------------------------------
    Description: 
Profiling is critical to performance engineering. Memory consumption is a key indicator of how efficient a PySpark program is. There is an existing effort on memory profiling of Python progrms, Memory Profiler ([https://pypi.org/project/memory-profiler/).|https://pypi.org/project/memory-profiler/]

PySpark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in the driver program. On the driver side, PySpark is a regular Python process, thus, we can profile it as a normal Python program using Memory Profiler.

However, on the executors side, we are missing such memory profiler. Since executors are distributed on different nodes in the cluster, we need to aggregate profiles. Furthermore, Python worker processes are spawned per executor for the Python/Pandas UDF execution, which makes the memory profiling more intricate.

The umbrella proposes to implement a Memory Profiler on Executors.

  was:
Profiling is critical to performance engineering. Memory consumption is a key indicator of how efficient a PySpark program is. There is an existing effort on memory profiling of Python progrms, Memory Profiler ([https://pypi.org/project/memory-profiler/).|https://pypi.org/project/memory-profiler/]

PySpark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in the driver program. On the driver side, PySpark is a regular Python process, thus, we can profile it as a normal Python program using Memory Profiler.

However, on the executors side, we are missing such memory profiler. Since executors are distributed on different nodes in the cluster, we need to need to aggregate profiles. Furthermore, Python worker processes are spawned per executor for the Python/Pandas UDF execution, which makes the memory profiling more intricate.

The umbrella proposes to implement a Memory Profiler on Executors.


> Memory Profiler on Executors
> ----------------------------
>
>                 Key: SPARK-40281
>                 URL: https://issues.apache.org/jira/browse/SPARK-40281
>             Project: Spark
>          Issue Type: Umbrella
>          Components: PySpark
>    Affects Versions: 3.4.0
>            Reporter: Xinrong Meng
>            Priority: Major
>
> Profiling is critical to performance engineering. Memory consumption is a key indicator of how efficient a PySpark program is. There is an existing effort on memory profiling of Python progrms, Memory Profiler ([https://pypi.org/project/memory-profiler/).|https://pypi.org/project/memory-profiler/]
> PySpark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in the driver program. On the driver side, PySpark is a regular Python process, thus, we can profile it as a normal Python program using Memory Profiler.
> However, on the executors side, we are missing such memory profiler. Since executors are distributed on different nodes in the cluster, we need to aggregate profiles. Furthermore, Python worker processes are spawned per executor for the Python/Pandas UDF execution, which makes the memory profiling more intricate.
> The umbrella proposes to implement a Memory Profiler on Executors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org