You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/01/24 03:31:00 UTC
[jira] [Comment Edited] (SPARK-26679) Deconflict spark.executor.pyspark.memory and spark.python.worker.memory

    [ https://issues.apache.org/jira/browse/SPARK-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750594#comment-16750594 ] 

Hyukjin Kwon edited comment on SPARK-26679 at 1/24/19 3:30 AM:
---------------------------------------------------------------

{quote}
There are two extreme cases: (1) an app which does a ton of stuff in python and uses a lot of python memory from user, but no usage of the sort machinery and (2) an app which uses the sort machinery within python, but makes very little use of allocating memory from user code. 
{quote}

Yes, but I was wondering if we have some Spark configurations for both cases in, for instance, somewhere core (correct me if I am mistaken). Current sort machinery in Python looks checking the given hard limit for a shared memory space (for instance, if somehow Python uses 300MB for other purpose, sort will use 200MB and spills if that's set to 500MB). So, it guess it won't be a matter if user uses how much memory they use. The configuration {{spark.python.worker.memory}} looks initially introduced by the same targets but for different purposes.




was (Author: hyukjin.kwon):
{quote}
There are two extreme cases: (1) an app which does a ton of stuff in python and uses a lot of python memory from user, but no usage of the sort machinery and (2) an app which uses the sort machinery within python, but makes very little use of allocating memory from user code. 
{quote}

Yes, but I was wondering if we have some Spark configurations for both cases in, for instance, somewhere core (correct me if I am mistaken). Current sort machinery in Python looks checking the given hard limit for a shared memory space (for instance, if somehow Python uses 300MB for other purpose, Spark will use 200MB and spills if that's set to 500MB). So, it guess it won't be a matter if user uses how much memory they use. The configuration {{spark.python.worker.memory}} looks initially introduced by the same targets but for different purposes.



> Deconflict spark.executor.pyspark.memory and spark.python.worker.memory
> -----------------------------------------------------------------------
>
>                 Key: SPARK-26679
>                 URL: https://issues.apache.org/jira/browse/SPARK-26679
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.4.0
>            Reporter: Ryan Blue
>            Priority: Major
>
> In 2.4.0, spark.executor.pyspark.memory was added to limit the total memory space of a python worker. There is another RDD setting, spark.python.worker.memory that controls when Spark decides to spill data to disk. These are currently similar, but not related to one another.
> PySpark should probably use spark.executor.pyspark.memory to limit or default the setting of spark.python.worker.memory because the latter property controls spilling and should be lower than the total memory limit. Renaming spark.python.worker.memory would also help clarity because it sounds like it should control the limit, but is more like the JVM setting spark.memory.fraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org