You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jim Brennan (JIRA)" <ji...@apache.org> on 2019/05/02 19:14:00 UTC

[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

    [ https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831883#comment-16831883 ] 

Jim Brennan commented on YARN-9527:
-----------------------------------

For example, we recently had a case where all of the disks used by yarn were full:
{noformat}
Filesystem      1K-blocks       Used Available Use% Mounted on
/dev/sdb4      5776759588 5714378904   4561576 100% /grid/1
/dev/sdd2      5840971776 5775661160   6849008 100% /grid/3
/dev/sdc2      5840971776 5777982304   4527864 100% /grid/2
/dev/sda4      5776759588 5712614448   6326032 100% /grid/0
{noformat}
Upon investigation, we found the NM log full of the “Invalid event: LOCALIZED at LOCALIZED” exceptions for a file called creative.data, and we found 2229 copies of that file in the usercache for the user:
{noformat}
-r-x------ 1 user1 users 441478442 Nov 26 15:07 ./1/100009/creative.data
-r-x------ 1 user1 users 441478442 Nov 26 15:07 ./1/100014/creative.data
-r-x------ 1 user1 users 441478442 Nov 26 15:07 ./1/100024/creative.data
-r-x------ 1 user1 users 441478442 Nov 26 15:08 ./1/100189/creative.data
-r-x------ 1 user1 users 441478442 Nov 26 15:08 ./1/100199/creative.data
-r-x------ 1 user1 users 441478442 Nov 26 15:08 ./1/100214/creative.data
-r-x------ 1 user1 users 441478442 Nov 26 15:08 ./1/100229/creative.data
-r-x------ 1 user1 users 441478442 Nov 26 15:08 ./1/100244/creative.data
…
{noformat}
We had a record of a similar problem reported back in September of 2017.
 I scanned our clusters to see how often this was happening. On some clusters, there were a significant number of nodes where this “LOCALIZED at LOCALIZED” exception had occurred. For example, on one cluster there were 122 nodes where I found that log message, some nodes with a large number:
{noformat}
  12566 node585n18:
  15053 node585n30:
  15819 node262n14:
  36182 node582n24:
  42623 node585n28:
  44447 node586n24:
  47380 node588n03:
 234528 node582n01:
 494196 node221n32:
 688038 node221n01:
1210223 node1442n30:
1306207 node194n06:
1331739 node1442n21:
1366933 node588n37:
1718461 node583n22:
2050377 node588n33:
2252679 node287n05:
{noformat}

> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -------------------------------------------------------------------------
>
>                 Key: YARN-9527
>                 URL: https://issues.apache.org/jira/browse/YARN-9527
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>    Affects Versions: 2.8.5, 3.1.2
>            Reporter: Jim Brennan
>            Priority: Major
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" exception on each iteration.  Sometimes this continues long enough that it fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org