You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Maximilian Michels (JIRA)" <ji...@apache.org> on 2016/09/24 12:23:20 UTC

[jira] [Comment Edited] (FLINK-4485) Finished jobs in yarn session fill /tmp filesystem

    [ https://issues.apache.org/jira/browse/FLINK-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15518937#comment-15518937 ] 

Maximilian Michels edited comment on FLINK-4485 at 9/24/16 12:22 PM:
---------------------------------------------------------------------

master: 4a8e94403fb48318561a3cf2da57ba9da280949e
release-1.1: 62c666f5794fa211bf570874b1b77044fd6840ac


was (Author: mxm):
master: cd43dd59e248766627c35f90038b2202ed9e52dc
release-1.1: 06496439a845820ff190edcf09d2ebbd28b0f0a5

> Finished jobs in yarn session fill /tmp filesystem
> --------------------------------------------------
>
>                 Key: FLINK-4485
>                 URL: https://issues.apache.org/jira/browse/FLINK-4485
>             Project: Flink
>          Issue Type: Bug
>          Components: JobManager
>    Affects Versions: 1.1.0
>            Reporter: Niels Basjes
>            Assignee: Maximilian Michels
>            Priority: Blocker
>             Fix For: 1.2.0, 1.1.3
>
>
> On a Yarn cluster I start a yarn-session with a few containers and task slots.
> Then I fire a 'large' number of Flink batch jobs in sequence against this yarn session. It is the exact same job (java code) yet it gets different parameters.
> In this scenario it is exporting HBase tables to files in HDFS and the parameters are about which data from which tables and the name of the target directory.
> After running several dozen jobs the jobs submission started to fail and we investigated.
> We found that the cause was that on the Yarn node which was hosting the jobmanager the /tmp file system was full (4GB was 100% full).
> How ever the output of {{du -hcs /tmp}} showed only 200MB in use.
> We found that a very large file (we guess it is the jar of the job) was put in /tmp , used, deleted yet the file handle was not closed by the jobmanager.
> As soon as we killed the jobmanager the disk space was freed.
> The summary of the impact of this is that a yarn-session that receives enough jobs brings down the Yarn node for all users.
> See parts of the output we got from {{lsof}} below.
> {code}
> COMMAND     PID      USER   FD      TYPE             DEVICE      SIZE       NODE NAME
> java      15034   nbasjes  550r      REG             253,17  66219695        245 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000003 (deleted)
> java      15034   nbasjes  551r      REG             253,17  66219695        252 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000007 (deleted)
> java      15034   nbasjes  552r      REG             253,17  66219695        267 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000012 (deleted)
> java      15034   nbasjes  553r      REG             253,17  66219695        250 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000005 (deleted)
> java      15034   nbasjes  554r      REG             253,17  66219695        288 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000018 (deleted)
> java      15034   nbasjes  555r      REG             253,17  66219695        298 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000025 (deleted)
> java      15034   nbasjes  557r      REG             253,17  66219695        254 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000008 (deleted)
> java      15034   nbasjes  558r      REG             253,17  66219695        292 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000019 (deleted)
> java      15034   nbasjes  559r      REG             253,17  66219695        275 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000013 (deleted)
> java      15034   nbasjes  560r      REG             253,17  66219695        159 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000002 (deleted)
> java      15034   nbasjes  562r      REG             253,17  66219695        238 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000001 (deleted)
> java      15034   nbasjes  568r      REG             253,17  66219695        246 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000004 (deleted)
> java      15034   nbasjes  569r      REG             253,17  66219695        255 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000009 (deleted)
> java      15034   nbasjes  571r      REG             253,17  66219695        299 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000026 (deleted)
> java      15034   nbasjes  572r      REG             253,17  66219695        293 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000020 (deleted)
> java      15034   nbasjes  574r      REG             253,17  66219695        256 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000010 (deleted)
> java      15034   nbasjes  575r      REG             253,17  66219695        302 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000029 (deleted)
> java      15034   nbasjes  576r      REG             253,17  66219695        294 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000021 (deleted)
> java      15034   nbasjes  577r      REG             253,17  66219695        262 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000011 (deleted)
> java      15034   nbasjes  578r      REG             253,17  66219695        251 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000006 (deleted)
> java      15034   nbasjes  580r      REG             253,17  66219695        295 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000022 (deleted)
> java      15034   nbasjes  581r      REG             253,17  66219695        300 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000027 (deleted)
> java      15034   nbasjes  582r      REG             253,17  66219695        188 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/cache/blob_e318d1698aa6e7dc91e5f4a9f8ba29781aebd8c4 (deleted)
> java      15034   nbasjes  585r      REG             253,17  66219695        279 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000014 (deleted)
> java      15034   nbasjes  586r      REG             253,17  66219695        296 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000023 (deleted)
> java      15034   nbasjes  588r      REG             253,17  66219695        301 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000028 (deleted)
> java      15034   nbasjes  589r      REG             253,17  66219695        297 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000024 (deleted)
> java      15034   nbasjes  598r      REG             253,17  66219695        280 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000015 (deleted)
> java      15034   nbasjes  601r      REG             253,17  66219695        289 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000016 (deleted)
> java      15034   nbasjes  604r      REG             253,17  66219695        284 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000017 (deleted)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)