You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Szilard Nemeth (Jira)" <ji...@apache.org> on 2020/12/19 00:13:00 UTC

[jira] [Commented] (YARN-10421) Create YarnDiagnosticsService to serve diagnostic queries

    [ https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17252051#comment-17252051 ] 

Szilard Nemeth commented on YARN-10421:
---------------------------------------

Thanks [~bteke] for working on this.

{quote}
1-2-7-9-11-12-15: corrected them.
{quote}
Good, the fixes are okay.

{quote}
3. Other similar methods of this class use the same message so I didn't want to introduce something different.
{quote}
Okay, then it's fine like this.

{quote}
I added some TODO comments. This class is used for testing purposes and the REST testing related tasks are part of YARN-10434.
{quote}
Okay. In general TODOs are not allowed AFAIK, but as you referred to a follow up YARN-10434, it's okay.
Can you add a reference to YARN-10434 in the code as well. Just add it to the TODO message.

{quote}
6. Thanks, I would have missed this document. As the overall implementation is subject to changes I'll create a new Jira for the documentation updates.
{quote}
Thanks.

{quote}
8. This item is to be changed in the tests, and in YARN-10432 it will be replaced by a configuration entry. Should I uppercase it regardless?
{quote}
No, I think in this cas it is fine.

{quote}
10. It is currently a placeholder, in YARN-10422 I'll bundle the script and after that I'll update the location.
{quote}
Okay, got it.

{quote}
13-14. I refactored the tests.
{quote}


Let me restart the numbering here: 
1. I noticed one more thing: 
In RMWebServices#getCommonIssueList and RMWebServices#getCommonIssueData: Both catch blocks catch all the Exceptions, but the newly created WebAppException doesn't have the expception (variable 'e') passed as the 2nd parameter (cause). This is bad, because the cause exception won't be propagated up and won't get printed properly as only the message is printed here.

2. Looking at FederationInterceptorREST once again: 
There are two new methods throwing NotImplementedException with message: "Code is not implemented". Can you add a YARN jira as a reference for the planned implementor of these methods?

3. Nit: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testParseIssueTypeValidCases
For me, this would be more readable if the line variable would be inlined for all usages.

4. Please add assertion messages for each assertion in method: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#assertIssueEquality

 

 

> Create YarnDiagnosticsService to serve diagnostic queries 
> ----------------------------------------------------------
>
>                 Key: YARN-10421
>                 URL: https://issues.apache.org/jira/browse/YARN-10421
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Benjamin Teke
>            Assignee: Benjamin Teke
>            Priority: Major
>         Attachments: YARN-10421.001.patch, YARN-10421.002.patch, YARN-10421.003.patch, YARN-10421.004.patch
>
>
> YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet forks a separate process, which executes a shell/Python/etc script. Based on the use-cases listed below the script collects information, bundles it and sends it to UI2. The diagnostic options are the following:
>  # Application hanging: 
>  ** Application logs
>  ** Find the hanging container and get multiple Jstacks
>  ** ResourceManager logs during job lifecycle
>  ** NodeManager logs from NodeManager where the hanging containers of the jobs ran
>  ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez History URL
>  # Application failed: 
>  ** Application logs
>  ** ResourceManager logs during job lifecycle.
>  ** NodeManager logs from NodeManager where the hanging containers of the jobs ran
>  ** Job Configuration from MapReduce HistoryServer, Spark HistoryServer, Tez History URL.
>  ** Job related metrics like container, attempts.
>  # Scheduler related issue:
>  ** ResourceManager Scheduler logs with DEBUG enabled for 2 minutes.
>  ** Multiple Jstacks of ResourceManager
>  ** YARN and Scheduler Configuration
>  ** Cluster Scheduler API _/ws/v1/cluster/scheduler_ and Cluster Nodes API _/ws/v1/cluster/nodes response_
>  ** Scheduler Activities _/ws/v1/cluster/scheduler/bulkactivities_ response (YARN-10319)
>  # ResourceManager / NodeManager daemon fails to start:
>  ** ResourceManager and NodeManager out and log file
>  ** YARN and Scheduler Configuration
> Two new endpoints should be added to the RM web service: one for listing the available diagnostic options (_/common-issue/list_), and one for calling a selected option with the user provided parameters (_/common-issue/collect_).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org