You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by GitBox <gi...@apache.org> on 2022/05/19 02:58:17 UTC
[GitHub] [yunikorn-core] surahman opened a new pull request, #412: [YUNIKORN-1213] Configurable Health Checker Interval
surahman opened a new pull request, #412:
URL: https://github.com/apache/yunikorn-core/pull/412
### What is this PR for?
The changes contained here allow for the configuration of the Health Check interval as well as the ability to disable the health checks outright.
The format for the `ConfigMap` is as follows:
```yaml
scheduler:
heathcheck:
enabled: true
interval: 30s
partitions:
...
```
### What type of PR is it?
* [ ] - Bug Fix
* [ ] - Improvement
* [x] - Feature
* [ ] - Documentation
* [ ] - Hot Fix
* [ ] - Refactoring
### Todos
* [x] - Extend `Config` with a `HealthCheck` subsection under `Scheduler`.
* [x] - Extend `health_checker` to support custom intervals and disabling of the health checks.
* [ ] - Extend `health_checker` with messages to indicate disabled health checks. This is important so that a disabled health check is not confused with a more serious issue.
* [ ] - Documentation updates.
### What is the Jira issue?
* [JIRA issue.](https://issues.apache.org/jira/browse/YUNIKORN-1213)
### How should this be tested?
Tests for all changes have been added to this PR. There is a need for additional deployment testing.
### Questions:
* [ ] - How would we like to handle the functions which return health check information? I can create an `isCheckEnabled` method that takes a `*ClusterContext` and return a `bool` based on the settings in the configs. The issue is that not all the methods take a `*ClusterContext` as a parameter.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [yunikorn-core] yangwwei commented on pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval
Posted by GitBox <gi...@apache.org>.
yangwwei commented on PR #412:
URL: https://github.com/apache/yunikorn-core/pull/412#issuecomment-1133459102
hi @surahman , @craigcondit can we address the hot-refresh in the next JIRA? In this one, let's focus on getting this configurable. @wilfred-s need your input here about the format of the config file, we need to get to a consensus so @surahman can move forward. thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [yunikorn-core] surahman commented on pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval
Posted by GitBox <gi...@apache.org>.
surahman commented on PR #412:
URL: https://github.com/apache/yunikorn-core/pull/412#issuecomment-1133185950
### REST runtime `Config` updates
Updating the `Config` object should not represent an issue. [`updateClusterConfig`](https://github.com/apache/yunikorn-core/blob/d81bf72c9fcdf5255816194d0d2ad68d7d3d69a7/pkg/webservice/handlers.go#L434) will generate the parsed updated `Config` into a struct and then reading/updating values is straight forward.
The issue arises with the `HealthCheck` daemon process and updating its setting. I was unable to find another daemon process which allows its configuration updated via the `Config`. The following is my proposed solution after a brief look:
1. Load and read the `enabled` and `interval` directly from the `Config` instead of the local data structure whenever needed. My concern with this is performance. Would it be significantly slower to get and read from `Config` compared to data members in the object?
2. The [`HealthChecker`](https://github.com/apache/yunikorn-core/blob/3160880808aa11384afa457e047335387d0f00f6/pkg/scheduler/health_checker.go#L91-L92) process would need to run and poll the `Config` for updates periodically. `runOnce` would not execute if `enabled=false`.
3. Another `interval` setting for the daemon to poll the `Config` for updates?
4. [`stop`](https://github.com/apache/yunikorn-core/blob/3160880808aa11384afa457e047335387d0f00f6/pkg/scheduler/health_checker.go#L98) will remain as it initially was because the `HealthCheck` daemon process will consistently be running.
5. I am unsure if the primary purpose of disabling the `HealthCheck` was for performance/overhead but only running the daemon would be lightweight.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [yunikorn-core] yangwwei commented on pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval
Posted by GitBox <gi...@apache.org>.
yangwwei commented on PR #412:
URL: https://github.com/apache/yunikorn-core/pull/412#issuecomment-1131243553
hi @surahman a few high-level feedback:
1. Instead of the following config
```
healthcheck:
enabled: %s
interval: 99s
partitions:
- name: default
queues:
- name: root
```
what was discussed looks like the following:
```
scheduler:
healthcheck:
enabled: true
interval: 30s
partitions:
- name: default
queues:
- name: root
```
We want to include all "global" (non-partition-specific) configs under a single parent "scheduler", including "checksum". I suggest using another JIRA to track the changes to the "checksum", just to make the PR smaller.
2. Please create a JIRA to track the document changes.
3. Please make sure the test case includes both having healthcheck section configured and not having this section.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [yunikorn-core] surahman closed pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval
Posted by GitBox <gi...@apache.org>.
surahman closed pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval
URL: https://github.com/apache/yunikorn-core/pull/412
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [yunikorn-core] craigcondit commented on pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval
Posted by GitBox <gi...@apache.org>.
craigcondit commented on PR #412:
URL: https://github.com/apache/yunikorn-core/pull/412#issuecomment-1132978350
CC @wilfred-s -- Wilfred, can you give some input on direction for this as well?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [yunikorn-core] surahman commented on pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval
Posted by GitBox <gi...@apache.org>.
surahman commented on PR #412:
URL: https://github.com/apache/yunikorn-core/pull/412#issuecomment-1132217702
I have examined all the functions for metrics generation in the `HealthCheck` and they are all called locally from the `runOnce` routine and the call tree that stems there. Given this, we can probably leave the routines as-is and add log messages at the `Warn` level to indicate that health metric reporting is disabled.
There are also two methods that are exported and only ever called locally and I wonder if we can switch them to not be exported?
https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker.go#L129
https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker.go#L146
What are your thoughts @yangwwei and @craigcondit?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [yunikorn-core] codecov[bot] commented on pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval
Posted by GitBox <gi...@apache.org>.
codecov[bot] commented on PR #412:
URL: https://github.com/apache/yunikorn-core/pull/412#issuecomment-1131958329
# [Codecov](https://codecov.io/gh/apache/yunikorn-core/pull/412?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#412](https://codecov.io/gh/apache/yunikorn-core/pull/412?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (aacdb88) into [master](https://codecov.io/gh/apache/yunikorn-core/commit/3ba91fb8a41c0fd0dd6243326e583dea5167199f?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (3ba91fb) will **increase** coverage by `0.09%`.
> The diff coverage is `96.29%`.
```diff
@@ Coverage Diff @@
## master #412 +/- ##
==========================================
+ Coverage 69.09% 69.18% +0.09%
==========================================
Files 67 67
Lines 9654 9676 +22
==========================================
+ Hits 6670 6694 +24
+ Misses 2740 2738 -2
Partials 244 244
```
| [Impacted Files](https://codecov.io/gh/apache/yunikorn-core/pull/412?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [pkg/common/configs/config.go](https://codecov.io/gh/apache/yunikorn-core/pull/412/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL2NvbW1vbi9jb25maWdzL2NvbmZpZy5nbw==) | `76.66% <ø> (ø)` | |
| [pkg/scheduler/scheduler.go](https://codecov.io/gh/apache/yunikorn-core/pull/412/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci9zY2hlZHVsZXIuZ28=) | `0.00% <0.00%> (ø)` | |
| [pkg/scheduler/health\_checker.go](https://codecov.io/gh/apache/yunikorn-core/pull/412/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci9oZWFsdGhfY2hlY2tlci5nbw==) | `86.63% <100.00%> (+1.43%)` | :arrow_up: |
| [pkg/scheduler/objects/application.go](https://codecov.io/gh/apache/yunikorn-core/pull/412/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FwcGxpY2F0aW9uLmdv) | `58.77% <0.00%> (-0.15%)` | :arrow_down: |
| [pkg/scheduler/context.go](https://codecov.io/gh/apache/yunikorn-core/pull/412/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci9jb250ZXh0Lmdv) | `31.61% <0.00%> (+0.60%)` | :arrow_up: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/yunikorn-core/pull/412?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/yunikorn-core/pull/412?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [3ba91fb...aacdb88](https://codecov.io/gh/apache/yunikorn-core/pull/412?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [yunikorn-core] surahman commented on pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval
Posted by GitBox <gi...@apache.org>.
surahman commented on PR #412:
URL: https://github.com/apache/yunikorn-core/pull/412#issuecomment-1131965103
> 1. Instead of the following config
>
> ```
> healthcheck:
> enabled: %s
> interval: 99s
> partitions:
> - name: default
> queues:
> - name: root
> ```
>
> what was discussed looks like the following:
>
> ```
> scheduler:
> healthcheck:
> enabled: true
> interval: 30s
> partitions:
> - name: default
> queues:
> - name: root
> ```
>
> We want to include all "global" (non-partition-specific) configs under a single parent "scheduler", including "checksum". I suggest using another JIRA to track the changes to the "checksum", just to make the PR smaller.
Thank you for pointing this out, it was a misunderstanding on my part 🤦🏼♂️.
I have created a [YUNIKORN-1220](https://issues.apache.org/jira/browse/YUNIKORN-1220) for the `checksum`.
> 2. Please create a JIRA to track the document changes.
[YUNIKORN-1219](https://issues.apache.org/jira/browse/YUNIKORN-1219)
> 3. Please make sure the test case includes both having healthcheck section configured and not having this section.
I did not include a test case for this scenario because this test should cover the default condition:
https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker_test.go#L64-L76
I can see the need for an explicit test to catch the case though and have added one.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [yunikorn-core] craigcondit commented on pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval
Posted by GitBox <gi...@apache.org>.
craigcondit commented on PR #412:
URL: https://github.com/apache/yunikorn-core/pull/412#issuecomment-1132971411
> I have examined all the functions for metrics generation in the `HealthCheck` and they are all called locally from the `runOnce` routine and the call tree that stems there. Given this, we can probably leave the routines as-is and add log messages at the `Warn` level to indicate that health metric reporting is disabled.
Let's not log anything if the reporting is disabled -- if it is, that's a conscious decision on the admin's part so let's not punish him by spamming the logs.
> There are also two methods that are exported and only ever called locally and I wonder if we can switch them to not be exported?
>
> https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker.go#L129
>
>
> https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker.go#L146
Making these private should be fine.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org