You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Varun Vasudev (JIRA)" <ji...@apache.org> on 2015/11/26 15:30:11 UTC
[jira] [Updated] (YARN-4309) Add debug information to application logs when a container fails

     [ https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Varun Vasudev updated YARN-4309:
--------------------------------
    Attachment: YARN-4309.001.patch

Uploaded an initial version of the patch. It's a little difficult to collect the information only for failures and easier to collect it for all runs. Essentially, collecting the information for failures in secure mode is a lot harder and requires changes to container-executor. I've made generation of the additional debug information optional, with the default set to false.

The patch creates a copy of launch_container.sh, the output of ls and the output of "find -L . -maxdepth 5 -ls".

There's no particular reason for maxdepth 5 - I'm happy to change it if someone feels some other value is more appropriate. The reason for find and ls is that ls will output the symlinks whereas find gives you the size of the file pointed to by the symlink.

This version of the patch is for Linux only. If someone knows the changes for Windows, I'll add those in.

Just for information, for a mapreduce pi job, this is what was generated for the directory contents:
{code}
ls:
total 32
-rw-r--r-- 1 varun varun  129 Nov 26 19:47 container_tokens
-rwx------ 1 varun varun  702 Nov 26 19:47 default_container_executor_session.sh
-rwx------ 1 varun varun  756 Nov 26 19:47 default_container_executor.sh
lrwxrwxrwx 1 varun varun  113 Nov 26 19:47 job.jar -> /var/hadoop/hadoop-3-data/grid/local/usercache/varun/appcache/application_1448547413698_0001/filecache/10/job.jar
lrwxrwxrwx 1 varun varun  114 Nov 26 19:47 job.xml -> /var/hadoop/hadoop-3-data/grid2/local/usercache/varun/appcache/application_1448547413698_0001/filecache/13/job.xml
-rwx------ 1 varun varun 4941 Nov 26 19:47 launch_container.sh
drwx--x--- 2 varun varun 4096 Nov 26 19:47 tmp
find:
1079692    4 drwx--x---   3 varun    varun        4096 Nov 26 19:47 .
1074586    4 -rw-r--r--   1 varun    varun          16 Nov 26 19:47 ./.default_container_executor.sh.crc
1074581    8 -rwx------   1 varun    varun        4941 Nov 26 19:47 ./launch_container.sh
1049070  104 -r-x------   1 varun    varun      105105 Nov 26 19:47 ./job.xml
1873872    4 drwx------   2 varun    varun        4096 Nov 26 19:47 ./job.jar
1873870  272 -r-x------   1 varun    varun      275886 Nov 26 19:47 ./job.jar/job.jar
1079695    4 drwx--x---   2 varun    varun        4096 Nov 26 19:47 ./tmp
1074582    4 -rw-r--r--   1 varun    varun          48 Nov 26 19:47 ./.launch_container.sh.crc
1074580    4 -rw-r--r--   1 varun    varun          12 Nov 26 19:47 ./.container_tokens.crc
1074585    4 -rwx------   1 varun    varun         756 Nov 26 19:47 ./default_container_executor.sh
1074583    4 -rwx------   1 varun    varun         702 Nov 26 19:47 ./default_container_executor_session.sh
1074579    4 -rw-r--r--   1 varun    varun         129 Nov 26 19:47 ./container_tokens
1074584    4 -rw-r--r--   1 varun    varun          16 Nov 26 19:47 ./.default_container_executor_session.sh.crc
{code}

> Add debug information to application logs when a container fails
> ----------------------------------------------------------------
>
>                 Key: YARN-4309
>                 URL: https://issues.apache.org/jira/browse/YARN-4309
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Varun Vasudev
>            Assignee: Varun Vasudev
>         Attachments: YARN-4309.001.patch
>
>
> Sometimes when a container fails, it can be pretty hard to figure out why it failed.
> My proposal is that if a container fails, we collect information about the container local dir and dump it into the container log dir. Ideally, I'd like to tar up the directory entirely, but I'm not sure of the security and space implications of such a approach. At the very least, we can list all the files in the container local dir, and dump the contents of launch_container.sh(into the container log dir).
> When log aggregation occurs, all this information will automatically get collected and make debugging such failures much easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)