You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "sam (Jira)" <ji...@apache.org> on 2021/10/09 11:53:00 UTC

[jira] [Created] (SPARK-36966) Spark evicts RDD partitions instead of allowing OOM

sam created SPARK-36966:
---------------------------

             Summary: Spark evicts RDD partitions instead of allowing OOM
                 Key: SPARK-36966
                 URL: https://issues.apache.org/jira/browse/SPARK-36966
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.4
            Reporter: sam


In the past Spark jobs would give OOM if an RDD could not fit into memory (when trying to cache with MEMORY_ONLY). These days it seems Spark jobs will evict partitions from the cache and recompute them from scratch.

We have some jobs that cache an RDD then traverse it 300 times.  In order to know when we need to increase the memory on our cluster, we need to know when it has run out of memory.

The new behaviour of Spark makes this difficult ... rather than the job OOMing (like in the past), the job instead just takes forever (our surrounding logic eventually times out the job).  Diagnosing why the job failed becomes difficult because it's not immediately obvious from the logs that the job has run out of memory (since no OOM is thrown).  One can find "evicted" log lines.

As a bit of a hack, we are using accumulators with mapPartitionsWithIndex and updating a count of the number of times each partition is traversed, then we forcably blow up the job when this count is 2 or more (indicating an evicition).  This hack doesn't work very well as it seems to give false positives (RCA not yet understood, it doesn't seem to be speculative execution, nor are tasks failing (`egrep "task .* in stage [0-9]+\.1" | wc -l` gives 0).

*Question 1*: Is there a way to disable this new behaviour of Spark and make it behave like it used to (i.e. just blow up with OOM) - I've looked in the Spark configuration and cannot find anything like "disable-eviction".

*Question 2*: Is there any method on the SparkSession or SparkContext that we can call to easily detect when eviction is happening?

If not, then for our use case this is an effective regression - we need a way to make Spark behave predictably, or at least a way to determine automatically when Spark is running slowly due to lack of memory.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org