You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by GitBox <gi...@apache.org> on 2022/05/19 02:58:17 UTC

[GitHub] [yunikorn-core] surahman opened a new pull request, #412: [YUNIKORN-1213] Configurable Health Checker Interval

surahman opened a new pull request, #412:
URL: https://github.com/apache/yunikorn-core/pull/412

   ### What is this PR for?
   The changes contained here allow for the configuration of the Health Check interval as well as the ability to disable the health checks outright.
   
   The format for the `ConfigMap` is as follows:
   ```yaml
   scheduler:
     heathcheck:
       enabled: true
       interval: 30s
     partitions:
   ...
   ```
   
   
   ### What type of PR is it?
   * [ ] - Bug Fix
   * [ ] - Improvement
   * [x] - Feature
   * [ ] - Documentation
   * [ ] - Hot Fix
   * [ ] - Refactoring
   
   ### Todos
   * [x] - Extend `Config` with a `HealthCheck` subsection under `Scheduler`.
   * [x] - Extend `health_checker` to support custom intervals and disabling of the health checks.
   * [ ] - Extend `health_checker` with messages to indicate disabled health checks. This is important so that a disabled health check is not confused with a more serious issue.
   * [ ] - Documentation updates.
   
   ### What is the Jira issue?
   * [JIRA issue.](https://issues.apache.org/jira/browse/YUNIKORN-1213)
   
   ### How should this be tested?
   Tests for all changes have been added to this PR. There is a need for additional deployment testing.
   
   ### Questions:
   * [ ] - How would we like to handle the functions which return health check information? I can create an `isCheckEnabled` method that takes a `*ClusterContext` and return a `bool` based on the settings in the configs. The issue is that not all the methods take a `*ClusterContext` as a parameter.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [yunikorn-core] yangwwei commented on pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval

Posted by GitBox <gi...@apache.org>.
yangwwei commented on PR #412:
URL: https://github.com/apache/yunikorn-core/pull/412#issuecomment-1133459102

   hi @surahman , @craigcondit can we address the hot-refresh in the next JIRA? In this one, let's focus on getting this configurable.  @wilfred-s need your input here about the format of the config file, we need to get to a consensus so @surahman can move forward. thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [yunikorn-core] surahman commented on pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval

Posted by GitBox <gi...@apache.org>.
surahman commented on PR #412:
URL: https://github.com/apache/yunikorn-core/pull/412#issuecomment-1133185950

   ### REST runtime `Config` updates
   
   Updating the `Config` object should not represent an issue. [`updateClusterConfig`](https://github.com/apache/yunikorn-core/blob/d81bf72c9fcdf5255816194d0d2ad68d7d3d69a7/pkg/webservice/handlers.go#L434) will generate the parsed updated `Config` into a struct and then reading/updating values is straight forward.
   
   The issue arises with the `HealthCheck` daemon process and updating its setting. I was unable to find another daemon process which allows its configuration updated via the `Config`. The following is my proposed solution after a brief look:
   
   1. Load and read the `enabled` and `interval` directly from the `Config` instead of the local data structure whenever needed. My concern with this is performance. Would it be significantly slower to get and read from `Config` compared to data members in the object?
   2. The [`HealthChecker`](https://github.com/apache/yunikorn-core/blob/3160880808aa11384afa457e047335387d0f00f6/pkg/scheduler/health_checker.go#L91-L92) process would need to run and poll the `Config` for updates periodically. `runOnce` would not execute if `enabled=false`.
   3. Another `interval` setting for the daemon to poll the `Config` for updates?
   4. [`stop`](https://github.com/apache/yunikorn-core/blob/3160880808aa11384afa457e047335387d0f00f6/pkg/scheduler/health_checker.go#L98) will remain as it initially was because the `HealthCheck` daemon process will consistently be running.
   5. I am unsure if the primary purpose of disabling the `HealthCheck` was for performance/overhead but only running the daemon would be lightweight.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [yunikorn-core] yangwwei commented on pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval

Posted by GitBox <gi...@apache.org>.
yangwwei commented on PR #412:
URL: https://github.com/apache/yunikorn-core/pull/412#issuecomment-1131243553

   hi @surahman  a few high-level feedback:
   
   1. Instead of the following config
   
   ```
   healthcheck:
     enabled: %s
     interval: 99s
   partitions:
     - name: default
       queues:
         - name: root
   ```
   
   what was discussed looks like the following:
   
   ```
   scheduler:
     healthcheck:
       enabled: true
       interval: 30s
   partitions:
     - name: default
        queues:
           - name: root
   ```
   
   We want to include all "global" (non-partition-specific) configs under a single parent "scheduler", including "checksum".  I suggest using another JIRA to track the changes to the "checksum", just to make the PR smaller.
   
   2. Please create a JIRA to track the document changes.
   3. Please make sure the test case includes both having healthcheck section configured and not having this section.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [yunikorn-core] surahman closed pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval

Posted by GitBox <gi...@apache.org>.
surahman closed pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval
URL: https://github.com/apache/yunikorn-core/pull/412


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [yunikorn-core] craigcondit commented on pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval

Posted by GitBox <gi...@apache.org>.
craigcondit commented on PR #412:
URL: https://github.com/apache/yunikorn-core/pull/412#issuecomment-1132978350

   CC @wilfred-s -- Wilfred, can you give some input on direction for this as well?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [yunikorn-core] surahman commented on pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval

Posted by GitBox <gi...@apache.org>.
surahman commented on PR #412:
URL: https://github.com/apache/yunikorn-core/pull/412#issuecomment-1132217702

   I have examined all the functions for metrics generation in the `HealthCheck` and they are all called locally from the `runOnce` routine and the call tree that stems there. Given this, we can probably leave the routines as-is and add log messages at the `Warn` level to indicate that health metric reporting is disabled.
   
   There are also two methods that are exported and only ever called locally and I wonder if we can switch them to not be exported?
   https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker.go#L129
   https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker.go#L146
   
   What are your thoughts @yangwwei and @craigcondit?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [yunikorn-core] codecov[bot] commented on pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval

Posted by GitBox <gi...@apache.org>.
codecov[bot] commented on PR #412:
URL: https://github.com/apache/yunikorn-core/pull/412#issuecomment-1131958329

   # [Codecov](https://codecov.io/gh/apache/yunikorn-core/pull/412?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#412](https://codecov.io/gh/apache/yunikorn-core/pull/412?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (aacdb88) into [master](https://codecov.io/gh/apache/yunikorn-core/commit/3ba91fb8a41c0fd0dd6243326e583dea5167199f?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (3ba91fb) will **increase** coverage by `0.09%`.
   > The diff coverage is `96.29%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #412      +/-   ##
   ==========================================
   + Coverage   69.09%   69.18%   +0.09%     
   ==========================================
     Files          67       67              
     Lines        9654     9676      +22     
   ==========================================
   + Hits         6670     6694      +24     
   + Misses       2740     2738       -2     
     Partials      244      244              
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/yunikorn-core/pull/412?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [pkg/common/configs/config.go](https://codecov.io/gh/apache/yunikorn-core/pull/412/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL2NvbW1vbi9jb25maWdzL2NvbmZpZy5nbw==) | `76.66% <ø> (ø)` | |
   | [pkg/scheduler/scheduler.go](https://codecov.io/gh/apache/yunikorn-core/pull/412/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci9zY2hlZHVsZXIuZ28=) | `0.00% <0.00%> (ø)` | |
   | [pkg/scheduler/health\_checker.go](https://codecov.io/gh/apache/yunikorn-core/pull/412/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci9oZWFsdGhfY2hlY2tlci5nbw==) | `86.63% <100.00%> (+1.43%)` | :arrow_up: |
   | [pkg/scheduler/objects/application.go](https://codecov.io/gh/apache/yunikorn-core/pull/412/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci9vYmplY3RzL2FwcGxpY2F0aW9uLmdv) | `58.77% <0.00%> (-0.15%)` | :arrow_down: |
   | [pkg/scheduler/context.go](https://codecov.io/gh/apache/yunikorn-core/pull/412/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGtnL3NjaGVkdWxlci9jb250ZXh0Lmdv) | `31.61% <0.00%> (+0.60%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/yunikorn-core/pull/412?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/yunikorn-core/pull/412?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [3ba91fb...aacdb88](https://codecov.io/gh/apache/yunikorn-core/pull/412?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [yunikorn-core] surahman commented on pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval

Posted by GitBox <gi...@apache.org>.
surahman commented on PR #412:
URL: https://github.com/apache/yunikorn-core/pull/412#issuecomment-1131965103

   > 1. Instead of the following config
   > 
   > ```
   > healthcheck:
   >   enabled: %s
   >   interval: 99s
   > partitions:
   >   - name: default
   >     queues:
   >       - name: root
   > ```
   > 
   > what was discussed looks like the following:
   > 
   > ```
   > scheduler:
   >   healthcheck:
   >     enabled: true
   >     interval: 30s
   > partitions:
   >   - name: default
   >      queues:
   >         - name: root
   > ```
   > 
   > We want to include all "global" (non-partition-specific) configs under a single parent "scheduler", including "checksum". I suggest using another JIRA to track the changes to the "checksum", just to make the PR smaller.
   
   Thank you for pointing this out, it was a misunderstanding on my part 🤦🏼‍♂️.
   I have created a [YUNIKORN-1220](https://issues.apache.org/jira/browse/YUNIKORN-1220) for the `checksum`.
   
   > 2. Please create a JIRA to track the document changes.
   
   [YUNIKORN-1219](https://issues.apache.org/jira/browse/YUNIKORN-1219)
   
   > 3. Please make sure the test case includes both having healthcheck section configured and not having this section.
   
   I did not include a test case for this scenario because this test should cover the default condition:
   https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker_test.go#L64-L76
   
    I can see the need for an explicit test to catch the case though and have added one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [yunikorn-core] craigcondit commented on pull request #412: [YUNIKORN-1213] Configurable Health Checker Interval

Posted by GitBox <gi...@apache.org>.
craigcondit commented on PR #412:
URL: https://github.com/apache/yunikorn-core/pull/412#issuecomment-1132971411

   > I have examined all the functions for metrics generation in the `HealthCheck` and they are all called locally from the `runOnce` routine and the call tree that stems there. Given this, we can probably leave the routines as-is and add log messages at the `Warn` level to indicate that health metric reporting is disabled.
   
   Let's not log anything if the reporting is disabled -- if it is, that's a conscious decision on the admin's part so let's not punish him by spamming the logs.
    
   > There are also two methods that are exported and only ever called locally and I wonder if we can switch them to not be exported?
   > 
   > https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker.go#L129
   > 
   > 
   > https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker.go#L146
   
   Making these private should be fine.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org