You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nick Poorman (JIRA)" <ji...@apache.org> on 2015/05/13 22:29:01 UTC
[jira] [Commented] (SPARK-1865) Improve behavior of cleanup of disk state

    [ https://issues.apache.org/jira/browse/SPARK-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14542641#comment-14542641 ] 

Nick Poorman commented on SPARK-1865:
-------------------------------------

This issue was originally brought up here: https://github.com/spark-jobserver/spark-jobserver/issues/151#issuecomment-101512376

I'm running a train job on mllib ALS continuously over and over again, on the same context, using spark-jobserver, on mesos, with spark.mesos.coarse = true. The idea is to keep the model as up-to-date as possible, while still being able to call the prediction job on the context with the trained model. I simply save the user and product features as Named RDDs in the train job and load them up on the the predict jobs to be used.

Every time the training occurs, (about every three minutes) the amount of disk space being used keeps increasing until it runs out. I've narrowed it down to files like these that are not being removed and thus causing the problem:

$ find /tmp -name \* | xargs du -hs

1.4G  /tmp/spark-e6844d42-6950-494a-bb59-95ebca0f93bb/blockmgr-1cd15ddf-def9-48c5-8e89-bfe1d81d1352/09
745M  /tmp/spark-e6844d42-6950-494a-bb59-95ebca0f93bb/blockmgr-1cd15ddf-def9-48c5-8e89-bfe1d81d1352/3b
317M  /tmp/spark-e6844d42-6950-494a-bb59-95ebca0f93bb/blockmgr-1cd15ddf-def9-48c5-8e89-bfe1d81d1352/05
856M  /tmp/spark-e6844d42-6950-494a-bb59-95ebca0f93bb/blockmgr-1cd15ddf-def9-48c5-8e89-bfe1d81d1352/1a
797M  /tmp/spark-e6844d42-6950-494a-bb59-95ebca0f93bb/blockmgr-1cd15ddf-def9-48c5-8e89-bfe1d81d1352/26
2.0M  /tmp/spark-e6844d42-6950-494a-bb59-95ebca0f93bb/blockmgr-1cd15ddf-def9-48c5-8e89-bfe1d81d1352/26/shuffle_65_29_0.data
2.1M  /tmp/spark-e6844d42-6950-494a-bb59-95ebca0f93bb/blockmgr-1cd15ddf-def9-48c5-8e89-bfe1d81d1352/26/shuffle_61_47_0.data
4.0K  /tmp/spark-e6844d42-6950-494a-bb59-95ebca0f93bb/blockmgr-1cd15ddf-def9-48c5-8e89-bfe1d81d1352/26/shuffle_17_31_0.index
2.2M  /tmp/spark-e6844d42-6950-494a-bb59-95ebca0f93bb/blockmgr-1cd15ddf-def9-48c5-8e89-bfe1d81d1352/26/shuffle_49_5_0.data
4.0K  /tmp/spark-e6844d42-6950-494a-bb59-95ebca0f93bb/blockmgr-1cd15ddf-def9-48c5-8e89-bfe1d81d1352/26/shuffle_82_17_0.index


I'll end up with a few thousand of those, which quickly eats up a few hundred GBs of disk space until the disk is completely full.

This didn't help:
 --gc_delay=20mins --gc_disk_headroom=0.6

This didn't help, was thinking maybe I could disable the spilling for the shuffle and not create those files. It still created those files on the slaves...:
spark.shuffle.spill = false

This didn't help...:
spark.cleaner.ttl = 600

I'm at a complete loss here. Because the context doesn't end training the model over and over again seems to never reap the shuffle files. I would think this would be a huge issue for Spark Streaming.

> Improve behavior of cleanup of disk state
> -----------------------------------------
>
>                 Key: SPARK-1865
>                 URL: https://issues.apache.org/jira/browse/SPARK-1865
>             Project: Spark
>          Issue Type: Improvement
>          Components: Deploy, Spark Core
>            Reporter: Aaron Davidson
>
> Right now the behavior of disk cleanup is centered around the exit hook of the executor, which attempts to cleanup shuffle files and disk manager blocks, but may fail. We should make this behavior more predictable, perhaps by letting the Standalone Worker cleanup the disk state, and adding a flag to disable having the executor cleanup its own state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org