You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Zilvinas Saltys <zi...@verizonmedia.com.INVALID> on 2021/08/31 13:18:22 UTC

memory_and_disk persistence level algorithm

I'm curious if someone could provide a bit deeper insight into how
memory_and_disk_ser persistence level works.

I've noticed that if my cluster has 2.2 TB of memory and I set the
persistence level to memory_only_ser that Spark will use about 2TB and the
storage tab shows 97-99% fraction cached (there's just not enough memory to
cache it all). The job obviously runs poorly.

My hope was because I'm just missing that tiny fraction that doesn't fit in
memory that I could put a few of those partitions on disk by setting
persistence level to: memory_and_disk_ser.

But what happens is really surprising to me. The job initially starts well
and 99% of memory is used and 1-2% goes to disk. But as the job continues
to run more and more and more is cached to disk. At some point it reaches
where 50% is cached to disk and the memory seems to be unused. The job runs
equally poorly as the job with memory_only_ser persistence level.

Spark version is 3.0.1 and I've seen this same behaviour before Spark 3 as
well. Does anyone have an understanding of what might be happening and why
the proportion cached on disk keeps growing even if there's free storage
memory?

Thanks