You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Szilard Nemeth (Jira)" <ji...@apache.org> on 2020/12/18 23:50:00 UTC
[jira] [Comment Edited] (YARN-10421) Create YarnDiagnosticsService
to serve diagnostic queries
[ https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197870#comment-17197870 ]
Szilard Nemeth edited comment on YARN-10421 at 12/18/20, 11:49 PM:
-------------------------------------------------------------------
Hi [~bteke],
Thanks for working on this.
Here are my comments:
1. In DefaultRequestInterceptorREST: The targetpath parameter has the same value (RMWSConsts.RM_WEB_SERVICE_PATH + RMWSConsts.SCHEDULE) for both getCommonIssueData and getCommonIssueList. Shouldn't you use COMMON_ISSUE_COLLECT in method getCommonIssueData?
2. Nit: There some added newlines for some files at the end, please remove those.
3. Nit: In FederationInterceptorREST, I would use the string for the exception "This feature is not yet implemented" or something like this instead of the current "Code is not implemented"
4. For org.apache.hadoop.yarn.server.router.webapp.MockRESTRequestInterceptor#getCommonIssueData, it doesn't clear why a single OK response is returned. Can you please add a comment to explain this behaviour?
5. Just by looking at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices#getCommonIssueData, it's hard to guess what values the issueId and issueParams fields could take.
Can you please add javadoc to the method?
UPDATE: Oh, I can see this is documented in RMWebServiceProtocol.
6. Can you add API documentation of these 2 new endpoints to the file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md?
7. The javadoc for org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServiceProtocol#getCommonIssueData seems wrong. These are not form params, they are query params as per the implementing code, instead.
8. Nit: Can you use uppercase name for org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#scriptLocation?
9. Nit: I would introduce an enum for argument types. You currently have it as a string constant here: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#LIST_ISSUES_ARGUMENT
10. About the script location defined by: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#scriptLocation.
What makes the file available at location "/tmp/diagnostics_collector.sh" ?
11. In the warning log in method org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#parseIssueType, please add the expected number of parameters as well to the message. I would also add a return statement after the log message for clarity.
12. Can you please rename the diagnostics_collector.sh, so that it somehow reflects it is intended to be used by tests?
13. There's a typo in comment of method org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testListCommonIssuesValidCaseWithOptionsToBeSkipped: ambigious -> ambiguous
14. In testcase: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testListCommonIssuesValidCaseWithOptionsToBeSkipped: You may simplify the testcase dramatically by using a loop to check the issue id, name and parameter.
15. In testcase: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testParseIssueTypeValidCases: The comment says:
// valid case: id, name, one parameter
But the test actually uses 2 parameters so this is a bit misleading.
was (Author: snemeth):
Hi [~bteke],
Thanks for working on this.
Here are my comments:
1. In FederationInterceptorREST: The targetpath parameter has the same value (RMWSConsts.RM_WEB_SERVICE_PATH + RMWSConsts.SCHEDULE) for both getCommonIssueData and getCommonIssueList. Shouldn't you use COMMON_ISSUE_COLLECT in method getCommonIssueData?
2. Nit: There some added newlines for some files at the end, please remove those.
3. Nit: In FederationInterceptorREST, I would use the string for the exception "This feature is not yet implemented" or something like this instead of the current "Code is not implemented"
4. For org.apache.hadoop.yarn.server.router.webapp.MockRESTRequestInterceptor#getCommonIssueData, it doesn't clear why a single OK response is returned. Can you please add a comment to explain this behaviour?
5. Just by looking at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices#getCommonIssueData, it's hard to guess what values the issueId and issueParams fields could take.
Can you please add javadoc to the method?
UPDATE: Oh, I can see this is documented in RMWebServiceProtocol.
6. Can you add API documentation of these 2 new endpoints to the file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md?
7. The javadoc for org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServiceProtocol#getCommonIssueData seems wrong. These are not form params, they are query params as per the implementing code, instead.
8. Nit: Can you use uppercase name for org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#scriptLocation?
9. Nit: I would introduce an enum for argument types. You currently have it as a string constant here: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#LIST_ISSUES_ARGUMENT
10. About the script location defined by: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#scriptLocation.
What makes the file available at location "/tmp/diagnostics_collector.sh" ?
11. In the warning log in method org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsService#parseIssueType, please add the expected number of parameters as well to the message. I would also add a return statement after the log message for clarity.
12. Can you please rename the diagnostics_collector.sh, so that it somehow reflects it is intended to be used by tests?
13. There's a typo in comment of method org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testListCommonIssuesValidCaseWithOptionsToBeSkipped: ambigious -> ambiguous
14. In testcase: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testListCommonIssuesValidCaseWithOptionsToBeSkipped: You may simplify the testcase dramatically by using a loop to check the issue id, name and parameter.
15. In testcase: org.apache.hadoop.yarn.server.resourcemanager.DiagnosticsServiceTest#testParseIssueTypeValidCases: The comment says:
// valid case: id, name, one parameter
But the test actually uses 2 parameters so this is a bit misleading.
> Create YarnDiagnosticsService to serve diagnostic queries
> ----------------------------------------------------------
>
> Key: YARN-10421
> URL: https://issues.apache.org/jira/browse/YARN-10421
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Benjamin Teke
> Assignee: Benjamin Teke
> Priority: Major
> Attachments: YARN-10421.001.patch, YARN-10421.002.patch, YARN-10421.003.patch, YARN-10421.004.patch
>
>
> YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet forks a separate process, which executes a shell/Python/etc script. Based on the use-cases listed below the script collects information, bundles it and sends it to UI2. The diagnostic options are the following:
> # Application hanging:
> ** Application logs
> ** Find the hanging container and get multiple Jstacks
> ** ResourceManager logs during job lifecycle
> ** NodeManager logs from NodeManager where the hanging containers of the jobs ran
> ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez History URL
> # Application failed:
> ** Application logs
> ** ResourceManager logs during job lifecycle.
> ** NodeManager logs from NodeManager where the hanging containers of the jobs ran
> ** Job Configuration from MapReduce HistoryServer, Spark HistoryServer, Tez History URL.
> ** Job related metrics like container, attempts.
> # Scheduler related issue:
> ** ResourceManager Scheduler logs with DEBUG enabled for 2 minutes.
> ** Multiple Jstacks of ResourceManager
> ** YARN and Scheduler Configuration
> ** Cluster Scheduler API _/ws/v1/cluster/scheduler_ and Cluster Nodes API _/ws/v1/cluster/nodes response_
> ** Scheduler Activities _/ws/v1/cluster/scheduler/bulkactivities_ response (YARN-10319)
> # ResourceManager / NodeManager daemon fails to start:
> ** ResourceManager and NodeManager out and log file
> ** YARN and Scheduler Configuration
> Two new endpoints should be added to the RM web service: one for listing the available diagnostic options (_/common-issue/list_), and one for calling a selected option with the user provided parameters (_/common-issue/collect_).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org