You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2015/12/16 01:03:46 UTC

[jira] [Assigned] (SPARK-12330) Mesos coarse executor does not cleanup blockmgr properly on termination if data is stored on disk

     [ https://issues.apache.org/jira/browse/SPARK-12330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-12330:
------------------------------------

    Assignee: Apache Spark

> Mesos coarse executor does not cleanup blockmgr properly on termination if data is stored on disk
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-12330
>                 URL: https://issues.apache.org/jira/browse/SPARK-12330
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager, Mesos
>    Affects Versions: 1.5.1
>            Reporter: Charles Allen
>            Assignee: Apache Spark
>
> A simple line count example can be launched as similar to 
> {code}
> SPARK_HOME=/mnt/tmp/spark MASTER=mesos://zk://zk.metamx-prod.com:2181/mesos-druid/metrics ./bin/spark-shell --conf spark.mesos.coarse=true --conf spark.cores.max=7 --conf spark.mesos.executor.memoryOverhead=2048 --conf spark.mesos.executor.home=/mnt/tmp/spark --conf spark.executor.extraJavaOptions='-Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 -XX:+PrintGCApplicationStoppedTime -XX:+PrintTenuringDistribution -XX:+PrintFlagsFinal -XX:+PrintAdaptiveSizePolicy -XX:+PrintReferenceGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MaxDirectMemorySize=1024m -verbose:gc -XX:+PrintFlagsFinal -Djava.io.tmpdir=/mnt/tmp/scratch' --conf spark.hadoop.fs.s3n.awsAccessKeyId='REDACTED' --conf spark.hadoop.fs.s3n.awsSecretAccessKey='REDACTED' --conf spark.executor.memory=7g --conf spark.executorEnv.GLOG_v=9 --conf spark.storage.memoryFraction=0.0 --conf spark.shuffle.memoryFraction=0.0
> {code}
> In the shell the following lines can be executed:
> {code}
> val text_file = sc.textFile("s3n://REDACTED/charlesallen/tpch/lineitem.tbl").persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
> {code}
> {code}
> text_file.map(l => 1).sum
> {code}
> which will result in
> {code}
> res0: Double = 6001215.0
> {code}
> for the TPCH 1GB dataset
> Unfortunately the blockmgr directory remains on the executor node after termination of the spark context.
> The log on the executor looks like this near the termination:
> {code}
> I1215 02:12:31.190878 130732 process.cpp:566] Parsed message name 'mesos.internal.ShutdownExecutorMessage' for executor(1)@172.19.67.30:58604 from slave(1)@172.19.67.30:5051
> I1215 02:12:31.190928 130732 process.cpp:2382] Spawned process __http__(4)@172.19.67.30:58604
> I1215 02:12:31.190932 130721 process.cpp:2392] Resuming executor(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.190924800+00:00
> I1215 02:12:31.190958 130702 process.cpp:2392] Resuming __http__(4)@172.19.67.30:58604 at 2015-12-15 02:12:31.190951936+00:00
> I1215 02:12:31.190976 130721 exec.cpp:381] Executor asked to shutdown
> I1215 02:12:31.190943 130727 process.cpp:2392] Resuming __gc__@172.19.67.30:58604 at 2015-12-15 02:12:31.190937088+00:00
> I1215 02:12:31.190991 130702 process.cpp:2497] Cleaning up __http__(4)@172.19.67.30:58604
> I1215 02:12:31.191032 130721 process.cpp:2382] Spawned process (2)@172.19.67.30:58604
> I1215 02:12:31.191040 130702 process.cpp:2392] Resuming (2)@172.19.67.30:58604 at 2015-12-15 02:12:31.191037952+00:00
> I1215 02:12:31.191054 130702 exec.cpp:80] Scheduling shutdown of the executor
> I1215 02:12:31.191069 130721 exec.cpp:396] Executor::shutdown took 21572ns
> I1215 02:12:31.191073 130702 clock.cpp:260] Created a timer for (2)@172.19.67.30:58604 in 5secs in the future (2015-12-15 02:12:36.191062016+00:00)
> I1215 02:12:31.191066 130720 process.cpp:2392] Resuming (1)@172.19.67.30:58604 at 2015-12-15 02:12:31.191059200+00:00
> 15/12/15 02:12:31 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
> I1215 02:12:31.240103 130732 clock.cpp:151] Handling timers up to 2015-12-15 02:12:31.240091136+00:00
> I1215 02:12:31.240123 130732 clock.cpp:158] Have timeout(s) at 2015-12-15 02:12:31.240036096+00:00
> I1215 02:12:31.240183 130730 process.cpp:2392] Resuming reaper(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.240178176+00:00
> I1215 02:12:31.240226 130730 clock.cpp:260] Created a timer for reaper(1)@172.19.67.30:58604 in 100ms in the future (2015-12-15 02:12:31.340212992+00:00)
> I1215 02:12:31.247019 130720 clock.cpp:260] Created a timer for (1)@172.19.67.30:58604 in 3secs in the future (2015-12-15 02:12:34.247005952+00:00)
> 15/12/15 02:12:31 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
> 15/12/15 02:12:31 INFO ShutdownHookManager: Shutdown hook called
> ....no more java logs
> {code}
> If the shuffle fraction is NOT set to 0.0, and the data is allowed to stay in memory, then the following log can be seen at termination instead:
> {code}
> I1215 01:19:16.247705 120052 process.cpp:566] Parsed message name 'mesos.internal.ShutdownExecutorMessage' for executor(1)@172.19.67.24:60016 from slave(1)@172.19.67.24:5051
> I1215 01:19:16.247745 120052 process.cpp:2382] Spawned process __http__(4)@172.19.67.24:60016
> I1215 01:19:16.247747 120034 process.cpp:2392] Resuming executor(1)@172.19.67.24:60016 at 2015-12-15 01:19:16.247741952+00:00
> I1215 01:19:16.247758 120030 process.cpp:2392] Resuming __gc__@172.19.67.24:60016 at 2015-12-15 01:19:16.247755008+00:00
> I1215 01:19:16.247772 120034 exec.cpp:381] Executor asked to shutdown
> I1215 01:19:16.247772 120038 process.cpp:2392] Resuming __http__(4)@172.19.67.24:60016 at 2015-12-15 01:19:16.247767808+00:00
> I1215 01:19:16.247791 120038 process.cpp:2497] Cleaning up __http__(4)@172.19.67.24:60016
> I1215 01:19:16.247814 120034 process.cpp:2382] Spawned process (2)@172.19.67.24:60016
> I1215 01:19:16.247823 120038 process.cpp:2392] Resuming (2)@172.19.67.24:60016 at 2015-12-15 01:19:16.247822080+00:00
> I1215 01:19:16.247836 120038 exec.cpp:80] Scheduling shutdown of the executor
> I1215 01:19:16.247820 120039 process.cpp:2392] Resuming __gc__@172.19.67.24:60016 at 2015-12-15 01:19:16.247816960+00:00
> I1215 01:19:16.247843 120034 exec.cpp:396] Executor::shutdown took 17355ns
> I1215 01:19:16.247849 120038 clock.cpp:260] Created a timer for (2)@172.19.67.24:60016 in 5secs in the future (2015-12-15 01:19:21.247843072+00:00)
> I1215 01:19:16.247855 120049 process.cpp:2392] Resuming (1)@172.19.67.24:60016 at 2015-12-15 01:19:16.247849984+00:00
> 15/12/15 01:19:16 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
> 15/12/15 01:19:16 INFO MemoryStore: MemoryStore cleared
> 15/12/15 01:19:16 INFO BlockManager: BlockManager stopped
> 15/12/15 01:19:16 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
> 15/12/15 01:19:16 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
> 15/12/15 01:19:16 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
> 15/12/15 01:19:16 INFO ShutdownHookManager: Shutdown hook called
> .... no more java logs
> {code}
> Notice in the former case with dirty directories left over, there is a SIGTERM intercepted by the executor, and there are no log messages indicating disk cleaning occurred. In the later case, the shutdown hooks for memory and block managers are seen.
> This is using Mesos 0.25.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org