You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jason White (JIRA)" <ji...@apache.org> on 2015/06/18 23:29:01 UTC

[jira] [Commented] (SPARK-8453) Unioning two RDDs in PySpark doesn't spill to disk

    [ https://issues.apache.org/jira/browse/SPARK-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592541#comment-14592541 ] 

Jason White commented on SPARK-8453:
------------------------------------

Interestingly, if you repartition the RDDs to the same number of partitions before unioning them, there are no memory spikes at all:
```
profiler = sc.textFile('/user/jasonwhite/profiler')
profiler = profiler.repartition(profiler.getNumPartitions())
profiler_2 = sc.textFile('/user/jasonwhite/profiler')
profiler_2 = profiler_2.repartition(profiler_2.getNumPartitions())
z = profiler.union(profiler_2)
z.count()
```
Obviously I'd rather not repartition every dataset after being loaded via `sc.textFile`. Any ideas as to what could be the issue?

> Unioning two RDDs in PySpark doesn't spill to disk
> --------------------------------------------------
>
>                 Key: SPARK-8453
>                 URL: https://issues.apache.org/jira/browse/SPARK-8453
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.4.0
>            Reporter: Jason White
>              Labels: memory, union
>
> When unioning 2 RDDs together in PySpark, spill limits do not seem to be recognized. Our YARN containers are frequently killed for exceeding memory limits for this reason.
> I have been able to reproduce this in the following simple scenario:
> - spark.executor.instances: 1, spark.executor.memory: 512m, spark.executor.cores: 20, spark.python.worker.reuse: false, spark.shuffle.spill: true, spark.yarn.executor.memoryOverhead: 5000
> (I recognize this is not a good setup - I set things up this way to explore this problem and make the symptom easier to isolate)
> I have a 1-billion-row dataset, split up evenly into 1000 partitions. Each partition contains exactly 1 million rows. Each row contains approximately 250 characters, +/- 10.
> I executed the following in a PySpark shell:
> ```
> profiler = sc.textFile('/user/jasonwhite/profiler')
> profiler_2 = sc.textFile('/user/jasonwhite/profiler')
> profiler.count()
> profiler_2.count()
> ```
> Total container memory utilization was between 2500 & 2800 MB over the total execution, with no spill. No problem.
> Then I executed:
> ```
> z = profiler.union(profiler_2)
> z.count()
> ```
> Memory utilization spiked immediately to between 4700 & 4900 MB over the course of execution, also with no spill. Big problem. Since we are setting our container memory sizes based in part on the Python spill limit, when these spill limits are not properly recognized, our containers are unexpectedly killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org