You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Connor Zanin <cn...@udel.edu> on 2015/10/28 02:29:38 UTC

python.worker.memory parameter

Hi all,

I am running a simple word count job on a cluster of 4 nodes (24 cores per
node). I am varying two parameter in the configuration,
spark.python.worker.memory and the number of partitions in the RDD. My job
is written in python.

I am observing a discontinuity in the run time of the job when the
spark.python.worker.memory is increased past a threshold. Unfortunately, I
am having trouble understanding exactly what this parameter is doing to
Spark internally and how it changes Spark's behavior to create this
discontinuity.

The documentation describes this parameter as "Amount of memory to use per
python worker process during aggregation," but I find this is vague (or I
do not know enough Spark terminology to know what it means).

I have been pointed to the source code in the past, specifically the
shuffle.py file where _spill() appears.

Can anyone explain how this parameter behaves or point me to more
descriptive documentation? Thanks!

-- 
Regards,

Connor Zanin
Computer Science
University of Delaware

Re: python.worker.memory parameter

Posted by Ted Yu <yu...@gmail.com>.

I found this parameter in python/pyspark/rdd.py

partitionBy() has some explanation on how it tries to reduce amount of data
transferred to Java.

FYI

On Tue, Oct 27, 2015 at 6:29 PM, Connor Zanin <cn...@udel.edu> wrote:

> Hi all,
>
> I am running a simple word count job on a cluster of 4 nodes (24 cores per
> node). I am varying two parameter in the configuration,
> spark.python.worker.memory and the number of partitions in the RDD. My job
> is written in python.
>
> I am observing a discontinuity in the run time of the job when the
> spark.python.worker.memory is increased past a threshold. Unfortunately, I
> am having trouble understanding exactly what this parameter is doing to
> Spark internally and how it changes Spark's behavior to create this
> discontinuity.
>
> The documentation describes this parameter as "Amount of memory to use
> per python worker process during aggregation," but I find this is vague (or
> I do not know enough Spark terminology to know what it means).
>
> I have been pointed to the source code in the past, specifically the
> shuffle.py file where _spill() appears.
>
> Can anyone explain how this parameter behaves or point me to more
> descriptive documentation? Thanks!
>
> --
> Regards,
>
> Connor Zanin
> Computer Science
> University of Delaware
>