You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Ryan Lo (Jira)" <ji...@apache.org> on 2022/04/20 14:38:00 UTC

[jira] [Assigned] (YUNIKORN-1179) Logs are spammed with health check status messages

     [ https://issues.apache.org/jira/browse/YUNIKORN-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan Lo reassigned YUNIKORN-1179:
---------------------------------

    Assignee: Ryan Lo

> Logs are spammed with health check status messages
> --------------------------------------------------
>
>                 Key: YUNIKORN-1179
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1179
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Peter Bacsko
>            Assignee: Ryan Lo
>            Priority: Major
>
> YUNIKORN-1107 introduced periodic background health check.
> The problem is, too much noise is printed to the console:
> {noformat}
> 2022-04-20T13:28:03.101Z	INFO	scheduler/health_checker.go:87	Scheduler is healthy	{"health check values": [{"Name":"Scheduling errors","Succeeded":true,"Description":"Check for scheduling error entries in metrics","DiagnosisMessage":"There were 0 scheduling errors logged in the metrics"},{"Name":"Failed nodes","Succeeded":true,"Description":"Check for failed nodes entries in metrics","DiagnosisMessage":"There were 0 failed nodes logged in the metrics"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for negative resources in the partitions","DiagnosisMessage":"Partitions with negative resources: []"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for negative resources in the nodes","DiagnosisMessage":"Nodes with negative resources: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if a node's allocated resource <= total resource of the node","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if total partition resource == sum of the node resources from the partition","DiagnosisMessage":"Partitions with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if node total resource = allocated resource + occupied resource + available resource","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if node capacity >= allocated resources on the node","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Reservation check","Succeeded":true,"Description":"Check the reservation nr compared to the number of nodes","DiagnosisMessage":"Reservation/node nr ratio: [0.000000]"},{"Name":"Orphan allocation on node check","Succeeded":true,"Description":"Check if there are orphan allocations on the nodes","DiagnosisMessage":"Orphan allocations: []"},{"Name":"Orphan allocation on app check","Succeeded":true,"Description":"Check if there are orphan allocations on the applications","DiagnosisMessage":"OrphanAllocations: []"}]}
> 2022-04-20T13:28:33.098Z	INFO	scheduler/health_checker.go:87	Scheduler is healthy	{"health check values": [{"Name":"Scheduling errors","Succeeded":true,"Description":"Check for scheduling error entries in metrics","DiagnosisMessage":"There were 0 scheduling errors logged in the metrics"},{"Name":"Failed nodes","Succeeded":true,"Description":"Check for failed nodes entries in metrics","DiagnosisMessage":"There were 0 failed nodes logged in the metrics"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for negative resources in the partitions","DiagnosisMessage":"Partitions with negative resources: []"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for negative resources in the nodes","DiagnosisMessage":"Nodes with negative resources: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if a node's allocated resource <= total resource of the node","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if total partition resource == sum of the node resources from the partition","DiagnosisMessage":"Partitions with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if node total resource = allocated resource + occupied resource + available resource","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if node capacity >= allocated resources on the node","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Reservation check","Succeeded":true,"Description":"Check the reservation nr compared to the number of nodes","DiagnosisMessage":"Reservation/node nr ratio: [0.000000]"},{"Name":"Orphan allocation on node check","Succeeded":true,"Description":"Check if there are orphan allocations on the nodes","DiagnosisMessage":"Orphan allocations: []"},{"Name":"Orphan allocation on app check","Succeeded":true,"Description":"Check if there are orphan allocations on the applications","DiagnosisMessage":"OrphanAllocations: []"}]}
> {noformat}
> I don't think we need that much output in every 30 seconds. In fact, if the scheduler is healthy, we don't need anything at all, maybe a short printout on DEBUG level, but nothing more.
> If the health check failed, then we might log it, but even in that case this looks unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org