You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jeff Frylings <je...@oracle.com> on 2018/05/30 15:49:44 UTC

Blockmgr directories intermittently not being cleaned up

Intermittently on spark executors we are seeing blockmgr directories not being cleaned up after execution and is filling up disk. These executors are using Mesos dynamic resource allocation and no single app using an executor seems to be the culprit. Sometimes an app will run and be cleaned up and then on a subsequent run that same AppExecId will run and not be cleaned up. The runs that have left behind folders did not have any obvious task failures in the SparkUI during that time frame.

The Spark shuffle service in the ami is version 2.1.1
The code is running on spark 2.0.2 in the mesos sandbox.

In a case where files are cleaned up the spark.log looks like the following
18/05/28 14:47:24 INFO ExternalShuffleBlockResolver: Registered executor AppExecId{appId=33d8fe79-a670-4277-b6f3-ee1049724204-8310, execId=95} with ExecutorShuffleInfo{localDirs=[/mnt/blockmgr-b2c7ff97-481e-4482-b9ca-92a5f8d4b25e], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
...
18/05/29 02:54:09 INFO MesosExternalShuffleBlockHandler: Application 33d8fe79-a670-4277-b6f3-ee1049724204-8310 timed out. Removing shuffle files.
18/05/29 02:54:09 INFO ExternalShuffleBlockResolver: Application 33d8fe79-a670-4277-b6f3-ee1049724204-8310 removed, cleanupLocalDirs = true

In a case where files are not cleaned up we do not see the "MesosExternalShuffleBlockHandler: Application <appId> timed out. Removing shuffle files."

We are using this config when starting the job "--conf spark.worker.cleanup.enabled=true" but I believe this only pertains to standalone mode and we are using the mesos deployment mode. So I don't think this flag actually does anything.

Thanks,
Jeff
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Blockmgr directories intermittently not being cleaned up

Posted by tBoyle <th...@gmail.com>.

I'm experiencing the same behaviour with shuffle data being orphaned on disk
(Spark 2.0.1 with Spark streaming).

We are using AWS R4 EC2 instances with 300GB EBS volumes attached, most
spilled shuffle data is eventually cleaned up by the ContextCleaner within
10 minutes. We do not use the external shuffle service and also use mesos. 

Occasionally some shuffle files are never removed until the application is
gracefully shutdown or dies due to lack of disk space. I am confident the
orphaned shuffle data is not in use by any jobs after 5 minutes (batch
duration). Did you know of any possible causes of this shuffle data not
being cleaned and left orphaned on the disk?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Blockmgr directories intermittently not being cleaned up

Posted by Jeff Frylings <je...@oracle.com>.

The logs are not the problem; it is the shuffle files that are not being cleaned up.  We do have the configs for log rolling and that is working just fine.

ex: /mnt/blockmgr-d65d4a74-d59a-4a06-af93-ba29232f7c5b/31/shuffle_1_46_0.data

> On May 30, 2018, at 9:54 AM, Ajay <aj...@gmail.com> wrote:
> 
> I have used these configs in the paths to clean up the executor logs.
> 
>       .set("spark.executor.logs.rolling.time.interval", "minutely")
>       .set("spark.executor.logs.rolling.strategy", "time")
>       .set("spark.executor.logs.rolling.maxRetainedFiles", "1")
> 
> On Wed, May 30, 2018 at 8:49 AM Jeff Frylings <jeff.frylings@oracle.com <ma...@oracle.com>> wrote:
> Intermittently on spark executors we are seeing blockmgr directories not being cleaned up after execution and is filling up disk.  These executors are using Mesos dynamic resource allocation and no single app using an executor seems to be the culprit.  Sometimes an app will run and be cleaned up and then on a subsequent run that same AppExecId will run and not be cleaned up.  The runs that have left behind folders did not have any obvious task failures in the SparkUI during that time frame.  
> 
> The Spark shuffle service in the ami is version 2.1.1
> The code is running on spark 2.0.2 in the mesos sandbox.
> 
> In a case where files are cleaned up the spark.log looks like the following
> 18/05/28 14:47:24 INFO ExternalShuffleBlockResolver: Registered executor AppExecId{appId=33d8fe79-a670-4277-b6f3-ee1049724204-8310, execId=95} with ExecutorShuffleInfo{localDirs=[/mnt/blockmgr-b2c7ff97-481e-4482-b9ca-92a5f8d4b25e], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
> ...
> 18/05/29 02:54:09 INFO MesosExternalShuffleBlockHandler: Application 33d8fe79-a670-4277-b6f3-ee1049724204-8310 timed out. Removing shuffle files.
> 18/05/29 02:54:09 INFO ExternalShuffleBlockResolver: Application 33d8fe79-a670-4277-b6f3-ee1049724204-8310 removed, cleanupLocalDirs = true
> 
> 
> In a case where files are not cleaned up we do not see the "MesosExternalShuffleBlockHandler: Application <appId> timed out. Removing shuffle files."
> 
> We are using this config when starting the job "--conf spark.worker.cleanup.enabled=true" but I believe this only pertains to standalone mode and we are using the mesos deployment mode. So I don't think this flag actually does anything. 
> 
> 
> Thanks,
> Jeff
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> 
> 
> 
> -- 
> Thanks,
> Ajay

Re: Blockmgr directories intermittently not being cleaned up

Posted by Ajay <aj...@gmail.com>.

I have used these configs in the paths to clean up the executor logs.

      .set("spark.executor.logs.rolling.time.interval", "minutely")
      .set("spark.executor.logs.rolling.strategy", "time")
      .set("spark.executor.logs.rolling.maxRetainedFiles", "1")

On Wed, May 30, 2018 at 8:49 AM Jeff Frylings <je...@oracle.com>
wrote:

> Intermittently on spark executors we are seeing blockmgr directories not
> being cleaned up after execution and is filling up disk.  These executors
> are using Mesos dynamic resource allocation and no single app using an
> executor seems to be the culprit.  Sometimes an app will run and be cleaned
> up and then on a subsequent run that same AppExecId will run and not be
> cleaned up.  The runs that have left behind folders did not have any
> obvious task failures in the SparkUI during that time frame.
>
> The Spark shuffle service in the ami is version 2.1.1
> The code is running on spark 2.0.2 in the mesos sandbox.
>
> In a case where files are cleaned up the spark.log looks like the following
> 18/05/28 14:47:24 INFO ExternalShuffleBlockResolver: Registered executor
> AppExecId{appId=33d8fe79-a670-4277-b6f3-ee1049724204-8310, execId=95} with
> ExecutorShuffleInfo{localDirs=[/mnt/blockmgr-b2c7ff97-481e-4482-b9ca-92a5f8d4b25e],
> subDirsPerLocalDir=64,
> shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
> ...
> 18/05/29 02:54:09 INFO MesosExternalShuffleBlockHandler: Application
> 33d8fe79-a670-4277-b6f3-ee1049724204-8310 timed out. Removing shuffle files.
> 18/05/29 02:54:09 INFO ExternalShuffleBlockResolver: Application
> 33d8fe79-a670-4277-b6f3-ee1049724204-8310 removed, cleanupLocalDirs = true
>
>
> In a case where files are not cleaned up we do not see the
> "MesosExternalShuffleBlockHandler: Application <appId> timed out. Removing
> shuffle files."
>
> We are using this config when starting the job "--conf
> spark.worker.cleanup.enabled=true" but I believe this only pertains to
> standalone mode and we are using the mesos deployment mode. So I don't
> think this flag actually does anything.
>
>
> Thanks,
> Jeff
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

-- 
Thanks,
Ajay