You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Szilard Nemeth (Jira)" <ji...@apache.org> on 2020/12/18 23:50:00 UTC

[jira] [Comment Edited] (YARN-10421) Create YarnDiagnosticsService to serve diagnostic queries

    [ https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197870#comment-17197870 ] 

Szilard Nemeth edited comment on YARN-10421 at 12/18/20, 11:49 PM:
-------------------------------------------------------------------

Hi [~bteke],

Thanks for working on this.

Here are my comments:

1. In DefaultRequestInterceptorREST: The targetpath parameter has the same value (RMWSConsts.RM_WEB_SERVICE_PATH + RMWSConsts.SCHEDULE) for both getCommonIssueData and getCommonIssueList. Shouldn't you use COMMON_ISSUE_COLLECT in method getCommonIssueData?

2. Nit: There some added newlines for some files at the end, please remove those.

3. Nit: In FederationInterceptorREST, I would use the string for the exception "This feature is not yet implemented" or something like this instead of the current "Code is not implemented"

4. For org.apache.hadoop.yarn.server.router.webapp.MockRESTRequestInterceptor#getCommonIssueData, it doesn't clear why a single OK response is returned. Can you please add a comment to explain this behaviour?

5. Just by looking at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices#getCommonIssueData, it's hard to guess what values the issueId and issueParams fields could take.
 Can you please add javadoc to the method?
 UPDATE: Oh, I can see this is documented in RMWebServiceProtocol.

6. Can you add API documentation of these 2 new endpoints to the file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md?

7. The javadoc for org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServiceProtocol#getCommonIssueData seems wrong. These are not form params, they are query params as per the implementing code, instead.

8. Nit: Can you use uppercase name for org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#scriptLocation?

9. Nit: I would introduce an enum for argument types. You currently have it as a string constant here: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#LIST_ISSUES_ARGUMENT

10. About the script location defined by: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#scriptLocation. 
 What makes the file available at location "/tmp/diagnostics_collector.sh" ?

11. In the warning log in method org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#parseIssueType, please add the expected number of parameters as well to the message. I would also add a return statement after the log message for clarity.

12. Can you please rename the diagnostics_collector.sh, so that it somehow reflects it is intended to be used by tests?

13. There's a typo in comment of method org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testListCommonIssuesValidCaseWithOptionsToBeSkipped: ambigious -> ambiguous

14. In testcase: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testListCommonIssuesValidCaseWithOptionsToBeSkipped: You may simplify the testcase dramatically by using a loop to check the issue id, name and parameter.

15. In testcase: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testParseIssueTypeValidCases: The comment says: 
 // valid case: id, name, one parameter
 But the test actually uses 2 parameters so this is a bit misleading.


was (Author: snemeth):
Hi [~bteke],

Thanks for working on this.

Here are my comments:

1. In FederationInterceptorREST: The targetpath parameter has the same value (RMWSConsts.RM_WEB_SERVICE_PATH + RMWSConsts.SCHEDULE) for both getCommonIssueData and getCommonIssueList. Shouldn't you use COMMON_ISSUE_COLLECT in method getCommonIssueData?

2. Nit: There some added newlines for some files at the end, please remove those.

3. Nit: In FederationInterceptorREST, I would use the string for the exception "This feature is not yet implemented" or something like this instead of the current "Code is not implemented"

4. For org.apache.hadoop.yarn.server.router.webapp.MockRESTRequestInterceptor#getCommonIssueData, it doesn't clear why a single OK response is returned. Can you please add a comment to explain this behaviour?

5. Just by looking at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices#getCommonIssueData, it's hard to guess what values the issueId and issueParams fields could take.
Can you please add javadoc to the method?
UPDATE: Oh, I can see this is documented in RMWebServiceProtocol.

6. Can you add API documentation of these 2 new endpoints to the file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md?

7. The javadoc for org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServiceProtocol#getCommonIssueData seems wrong. These are not form params, they are query params as per the implementing code, instead.

8. Nit: Can you use uppercase name for org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#scriptLocation?

9. Nit: I would introduce an enum for argument types. You currently have it as a string constant here: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#LIST_ISSUES_ARGUMENT

10. About the script location defined by: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#scriptLocation. 
What makes the file available at location "/tmp/diagnostics_collector.sh" ?

11. In the warning log in method org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#parseIssueType, please add the expected number of parameters as well to the message. I would also add a return statement after the log message for clarity.

12. Can you please rename the diagnostics_collector.sh, so that it somehow reflects it is intended to be used by tests?

13. There's a typo in comment of method org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testListCommonIssuesValidCaseWithOptionsToBeSkipped: ambigious -> ambiguous

14. In testcase: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testListCommonIssuesValidCaseWithOptionsToBeSkipped: You may simplify the testcase dramatically by using a loop to check the issue id, name and parameter.

15. In testcase: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testParseIssueTypeValidCases: The comment says: 
// valid case: id, name, one parameter
But the test actually uses 2 parameters so this is a bit misleading.

> Create YarnDiagnosticsService to serve diagnostic queries 
> ----------------------------------------------------------
>
>                 Key: YARN-10421
>                 URL: https://issues.apache.org/jira/browse/YARN-10421
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Benjamin Teke
>            Assignee: Benjamin Teke
>            Priority: Major
>         Attachments: YARN-10421.001.patch, YARN-10421.002.patch, YARN-10421.003.patch, YARN-10421.004.patch
>
>
> YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet forks a separate process, which executes a shell/Python/etc script. Based on the use-cases listed below the script collects information, bundles it and sends it to UI2. The diagnostic options are the following:
>  # Application hanging: 
>  ** Application logs
>  ** Find the hanging container and get multiple Jstacks
>  ** ResourceManager logs during job lifecycle
>  ** NodeManager logs from NodeManager where the hanging containers of the jobs ran
>  ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez History URL
>  # Application failed: 
>  ** Application logs
>  ** ResourceManager logs during job lifecycle.
>  ** NodeManager logs from NodeManager where the hanging containers of the jobs ran
>  ** Job Configuration from MapReduce HistoryServer, Spark HistoryServer, Tez History URL.
>  ** Job related metrics like container, attempts.
>  # Scheduler related issue:
>  ** ResourceManager Scheduler logs with DEBUG enabled for 2 minutes.
>  ** Multiple Jstacks of ResourceManager
>  ** YARN and Scheduler Configuration
>  ** Cluster Scheduler API _/ws/v1/cluster/scheduler_ and Cluster Nodes API _/ws/v1/cluster/nodes response_
>  ** Scheduler Activities _/ws/v1/cluster/scheduler/bulkactivities_ response (YARN-10319)
>  # ResourceManager / NodeManager daemon fails to start:
>  ** ResourceManager and NodeManager out and log file
>  ** YARN and Scheduler Configuration
> Two new endpoints should be added to the RM web service: one for listing the available diagnostic options (_/common-issue/list_), and one for calling a selected option with the user provided parameters (_/common-issue/collect_).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org