You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@yunikorn.apache.org by Chaoran Yu <yu...@gmail.com> on 2021/04/14 20:04:03 UTC

YuniKorn Metrics

Hello Tao,

During our discussion with Wilfred yesterday, he mentioned that you folks
at Alibaba have been running YuniKorn at some decent scale. We are also
trying some big workloads (Spark batch jobs) with YuniKorn and would like
to have better visibility in terms of the scheduling performance, and also
create alerts to help us spot issues as soon as they happen. We found that
the current list of metrics that are available in the core are not
comprehensive and some seem to be incorrectly computed. We are reaching out
to kindly ask you what metrics you have found to be most helpful? Or did
you add some new metrics? A more generic question is how have you been
monitoring YuniKorn? Many thanks in advance.

If anyone else on the mailing list has ideas to chime in, that would be
awesome too.

Regards,
Chaoran

Re: YuniKorn Metrics

Posted by Chaoran Yu <yu...@gmail.com>.
Thanks Tao for the details, that's very helpful. Really appreciate it!
We'll look into the options that you mentioned.

On Thu, Apr 15, 2021 at 11:49 AM Tao Yang <ta...@apache.org> wrote:

> Hi, Chaoran
>
> Sorry to be late for this response. Yes, We have did some performance
> tests and found that the scheduling process is far from transparent at the
> beginning, just as you said, the internal metrics is not good enough for us
> to spot issues or locate bottlenecks. So we have tried to explored more
> approaches to improve the visibility of scheduling process as following:
> 1) Broaden the horizon: the scheduling process is just one part in pod
> lifecycle, we want to know all the phases in pod lifecycle and know exactly
> where is the biggest bottleneck. And we indeed found some bottlenecks which
> are much bigger elsewhere in APIServer, some CNI/CSI services or Kubelet,
> via monitoring and parsing all key times (e.g.
> create/scheduled/started/initialized/ready/containers-ready times) out from
> every Pod, aggregating some data, showing them in charts of Grafana UI.
> This helps a lot to locate the bottlenecks quickly in whole pod lifecycle.
> 2) Dig more details: use existing tracing framework (e.g. OpenTracing) to
> collect tracing information in a standardized format for scheduling and
> resource management, the traces are following the time and space sequence
> of scheduling process, and can be collected periodically or on-demand to
> help spotting issues. Please refer to YUNIKORN-387 for details, Weihao
> Zheng will keep making effort to this feature.
> 3) We also developed a simple profiling tool which is easily to be
> injected in any places and give a statistic report periodically or
> on-demand, so that we can clearly see the performance details in any
> processes.
>
> Hope this can help. Thanks.
>
> Regards,
> Tao
>
> Chaoran Yu <yu...@gmail.com> 于2021年4月15日周四 上午4:04写道:
>
>> Hello Tao,
>>
>> During our discussion with Wilfred yesterday, he mentioned that you folks
>> at Alibaba have been running YuniKorn at some decent scale. We are also
>> trying some big workloads (Spark batch jobs) with YuniKorn and would like
>> to have better visibility in terms of the scheduling performance, and also
>> create alerts to help us spot issues as soon as they happen. We found that
>> the current list of metrics that are available in the core are not
>> comprehensive and some seem to be incorrectly computed. We are reaching out
>> to kindly ask you what metrics you have found to be most helpful? Or did
>> you add some new metrics? A more generic question is how have you been
>> monitoring YuniKorn? Many thanks in advance.
>>
>> If anyone else on the mailing list has ideas to chime in, that would be
>> awesome too.
>>
>> Regards,
>> Chaoran
>>
>

Re: YuniKorn Metrics

Posted by Tao Yang <ta...@apache.org>.
Hi, Chaoran

Sorry to be late for this response. Yes, We have did some performance tests
and found that the scheduling process is far from transparent at the
beginning, just as you said, the internal metrics is not good enough for us
to spot issues or locate bottlenecks. So we have tried to explored more
approaches to improve the visibility of scheduling process as following:
1) Broaden the horizon: the scheduling process is just one part in pod
lifecycle, we want to know all the phases in pod lifecycle and know exactly
where is the biggest bottleneck. And we indeed found some bottlenecks which
are much bigger elsewhere in APIServer, some CNI/CSI services or Kubelet,
via monitoring and parsing all key times (e.g.
create/scheduled/started/initialized/ready/containers-ready times) out from
every Pod, aggregating some data, showing them in charts of Grafana UI.
This helps a lot to locate the bottlenecks quickly in whole pod lifecycle.
2) Dig more details: use existing tracing framework (e.g. OpenTracing) to
collect tracing information in a standardized format for scheduling and
resource management, the traces are following the time and space sequence
of scheduling process, and can be collected periodically or on-demand to
help spotting issues. Please refer to YUNIKORN-387 for details, Weihao
Zheng will keep making effort to this feature.
3) We also developed a simple profiling tool which is easily to be injected
in any places and give a statistic report periodically or on-demand, so
that we can clearly see the performance details in any processes.

Hope this can help. Thanks.

Regards,
Tao

Chaoran Yu <yu...@gmail.com> 于2021年4月15日周四 上午4:04写道:

> Hello Tao,
>
> During our discussion with Wilfred yesterday, he mentioned that you folks
> at Alibaba have been running YuniKorn at some decent scale. We are also
> trying some big workloads (Spark batch jobs) with YuniKorn and would like
> to have better visibility in terms of the scheduling performance, and also
> create alerts to help us spot issues as soon as they happen. We found that
> the current list of metrics that are available in the core are not
> comprehensive and some seem to be incorrectly computed. We are reaching out
> to kindly ask you what metrics you have found to be most helpful? Or did
> you add some new metrics? A more generic question is how have you been
> monitoring YuniKorn? Many thanks in advance.
>
> If anyone else on the mailing list has ideas to chime in, that would be
> awesome too.
>
> Regards,
> Chaoran
>