You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@yunikorn.apache.org by Bowen Li <bo...@gmail.com> on 2022/01/06 00:14:41 UTC
Re: Observability of actually cpu/memory usage

Thanks all for your inputs. To clarify, the goal of the conversation is to
get some consensus and common understanding on motivation and business
needs first, though seems the discussion starts to diverge to
implementations.

Maybe let's retake a look at the business needs first - the use case is for
data engineers/scientists to find optimization room upon
over/under-allocation of cpu/memory, and thus need a way to look into
container cpu/memory usage both at an aggregated level of all containers in
a Spark job, or navigate to just one container as part of a job. What's
missing is the association. Say a Spark job has 2 executor pods (a, b),
users basically wanna see 1) aggregated cpu/memory usage, like sum(runtime
cpu(a, b)) v.s. sum(requested cpu(a, b)), 2) identify executor pods are
a+b, not a+c nor d+e, and quickly navigate to runtime-cpu(a) v.s.
requested-cpu(a).

We are pay close-source vendors big bills just to do this single thing,
while it's much better to have an open source solution, and YK is in the
best position to connect scheduling info with metrics. It's a very common
requirement once the scale of workload goes beyond hundreds of cpus, not
just from us, and we'll see it coming up as YK adoption grows.

There seems to be different ideas and implementations, e.g. how to do it?
shall it be a pluggable model? where it should live? but I hope to keep it
for later discussion.

Can you share your thoughts on the motivation? If it looks good, and the
YIP proposal (see another thread I just sent) passes, we can start a formal
design discussion as YIP-1.

Thanks,
Bowen


On Wed, Dec 22, 2021 at 7:48 PM Wilfred Spiegelenburg <wi...@apache.org>
wrote:

> We should be careful adding functionality to the scheduler that is not part
> of the scheduling cycle. Monitoring the real usage of a pod is not part of
> scheduling. It is part of the node metrics that the pod runs on. YuniKorn
> is a scheduler, it does not have a presence on the nodes. We should not
> create a presence on the nodes from this project. We have to rely on what
> the current system can provide.
>
> The metrics server readme [1] clearly states that it should *not* be used
> as a source for monitoring solutions. Instead they should be using the
> kubelet's /metrics/resource, or /metrics/cadvisor, endpoints. That would
> mean that each node would need to be polled to get the metric details. That
> kind of monitoring is outside of a scheduler's core tasks. Monitoring
> nodes places a different set of requirements on the scheduler for
> networking etc. Monitoring solutions, like Prometheus [2], already provide
> this kind of functionality as an out of the box option, adding that to
> YuniKorn is not the correct solution.
>
> I completely agree that we need to provide as many details and metrics
> around the scheduling as we can. Queues, Applications and Nodes should all
> expose metrics from a scheduling point of view. We should provide enough
> detail in the metrics to allow analysis of an application's life cycle.
>
> Wilfred
>
> [1]
> https://github.com/kubernetes-sigs/metrics-server#kubernetes-metrics-server
> [2]
>
> https://github.com/prometheus/prometheus/blob/10e72596b95db8fa0fe5f7472691930a3393cf45/documentation/examples/prometheus-kubernetes.yml#L96
>
> On Wed, 22 Dec 2021 at 11:54, Chenya Zhang <ch...@gmail.com>
> wrote:
>
> > From metrics server's documentation,
> >
> > Don't use Metrics Server when you need:
> > - Non-Kubernetes clusters
> > - An accurate source of resource usage metrics
> > - Horizontal autoscaling based on other resources than CPU/Memory
> >
> > I think they have some concerns on metrics accuracy. We may need to
> > understand what are some possible risks here.
> >
> > For example, if a user is trying to tune an application but gets
> > conflicting information in different runs, it could be confusing for
> them.
> > If there is a good range of consistency or any potential areas of
> > inaccuracy that can be documented, it would be a helpful source of
> > information for application tuning.
> >
> >
> > On Tue, Dec 21, 2021 at 3:19 PM Weiwei Yang <ww...@apache.org> wrote:
> >
> > > K8s dashboard did some integration with metrics-server, maybe we can
> > > investigate and see how that was done.
> > > Essentially we just need to pull these metrics somewhere.
> > >
> > > On Tue, Dec 21, 2021 at 2:42 PM Chaoran Yu <yu...@gmail.com>
> > > wrote:
> > >
> > > > Previously when doing research on this topic, I saw that the
> > > metrics-server
> > > > documentation says:"*Metrics Server is not meant for non-autoscaling
> > > > purposes. For example, don't use it to forward metrics to monitoring
> > > > solutions, or as a source of monitoring solution metrics. In such
> cases
> > > > please collect metrics from Kubelet /metrics/resource endpoint
> > > directly*."
> > > > But the Kubelet APIs
> > > > <
> > > >
> > >
> >
> https://github.com/kubernetes/kubernetes/blob/v1.21.5/pkg/kubelet/server/server.go#L236
> > > > >that
> > > > the statement refers to are not documented, meaning they are hidden
> > APIs
> > > > that can change or be deprecated at any future Kubernetes release.
> > > > Integrating with these APIs doesn't sound promising. But besides
> > Kubelet,
> > > > the actual utilization info of workloads is not readily available
> > > anywhere
> > > > else. We'll need to explore other ideas.
> > > >
> > > > On Tue, Dec 21, 2021 at 12:51 PM Weiwei Yang <ww...@apache.org>
> wrote:
> > > >
> > > > > Thank you Bowen to raise this up, this is an interesting topic.
> Bear
> > > with
> > > > > me this long reply : )
> > > > >
> > > > > Like Wilfred mentioned, YK doesn't know about the actual used
> > resources
> > > > in
> > > > > terms of CPU and memory for each pod, or application, at least not
> > > > today. I
> > > > > understand the requirements about tracking this info in order to
> give
> > > > users
> > > > > some feedback or even recommendations on how to tune their jobs
> more
> > > > > properly. It would be good to have something in our view as
> > "Allocated"
> > > > vs
> > > > > "Used" for each app/queue. We could further introduce some
> penalties
> > if
> > > > > people keep over-requesting resources.
> > > > >
> > > > > However, most likely we will need to do this outside of YK. The
> major
> > > > > reason is all data YK is consuming are from api-server, backed by
> > etcd.
> > > > Non
> > > > > of such metrics will be stored in etcd, as per design in
> > metrics-server
> > > > > <https://github.com/kubernetes-sigs/metrics-server>. Second, YK
> > > doesn't
> > > > > have any per-node agent running that we can facilitate to collect
> > > actual
> > > > > resource usages, we still need to leverage a 3rd party tool to do
> so.
> > > > Maybe
> > > > > we can do some integration with metrics-server, aggregating
> app/queue
> > > > used
> > > > > info from those fragmented metrics, and then plug that into our
> > > > > yunikorn-web UI. We have the flexibility to do this I believe,
> which
> > > > could
> > > > > be an option.
> > > > >
> > > > > On Mon, Dec 20, 2021 at 10:28 PM Wilfred Spiegelenburg <
> > > > > wilfreds@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Hi Bowen,
> > > > > >
> > > > > > Maybe a strange question but what do you consider "actually
> > > > > > used" resources? Anything the scheduler sees is used. The
> scheduler
> > > has
> > > > > no
> > > > > > information on what the container really occupies: it asked for
> > 100GB
> > > > but
> > > > > > it only allocated 50GB etc. If you need that YuniKorn cannot help
> > > you.
> > > > If
> > > > > > it is just a looking at allocation over time YuniKorn is capable
> of
> > > > > giving
> > > > > > you the information.
> > > > > >
> > > > > > Second point to make is that normally applications do not provide
> > any
> > > > > > information on what they expect to use before they use it. Let's
> > > take a
> > > > > > Spark application. The driver creates pods as it needs new
> > executors.
> > > > The
> > > > > > Spark config drives those requests and the limitations. The
> > scheduler
> > > > > only
> > > > > > sees the pods that are really requested. It does not know, and
> > should
> > > > not
> > > > > > know, if that is limited by what is configured or that the job
> uses
> > > > only
> > > > > > part or more than what is configured.
> > > > > >
> > > > > > The only time the scheduler would have any idea about a "maximum"
> > is
> > > > > when a
> > > > > > gang request is made. For gang scheduling we can track if the
> gang
> > > > > > request is completely used or not. We could add metrics for it on
> > an
> > > > > > application. We can also track the number of containers allocated
> > for
> > > > an
> > > > > > application or queue, the time from start to finish for
> containers
> > > etc.
> > > > > We
> > > > > > could even track the maximum resource allocation for an
> application
> > > or
> > > > a
> > > > > > queue over a time interval. Prometheus should give us a number of
> > > > > > possibilities, we just need to hook them into the scheduling
> cycle.
> > > > > >
> > > > > > As far as I know we currently do not have application metrics but
> > > that
> > > > > can
> > > > > > always be added. Some queue metrics are there already. I think
> one
> > of
> > > > > those
> > > > > > is what you are looking for to fill a number of the gaps that you
> > > see.
> > > > I
> > > > > > have added YUNIKORN-829 as a subtask to YUNIKORN-720 [1] which is
> > > > already
> > > > > > referencing a number of metrics to improve. With the release of
> > > > v0.12.1 I
> > > > > > moved that jira to v1.0.0. A major improvement to the metrics
> would
> > > be
> > > > a
> > > > > > nice addition for the v1.0.0.
> > > > > >
> > > > > > I would not see anything that is blocking enhancing metrics: it
> is
> > a
> > > > part
> > > > > > that can be improved without a major impact on other
> functionality.
> > > We
> > > > do
> > > > > > need to make sure that we measure the impact on performance and
> > > memory
> > > > > > usage.
> > > > > >
> > > > > > Wilfred
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/YUNIKORN-720
> > > > > >
> > > > > > On Tue, 21 Dec 2021 at 16:18, Bowen Li <bl...@apache.org> wrote:
> > > > > >
> > > > > > > Hi community,
> > > > > > >
> > > > > > > Reviving https://issues.apache.org/jira/browse/YUNIKORN-829 .
> We
> > > are
> > > > > > > running Spark on YuniKorn, and have a requirement to provide
> more
> > > > > > > observability of *actual* resource usage for our customers,
> data
> > > > > > > engineers/scientists who wrote Spark jobs who may not have deep
> > > > > expertise
> > > > > > > in Spark job optimization.
> > > > > > >
> > > > > > > - requirement:
> > > > > > >
> > > > > > > - have actual resource usage metrics at both job level and
> queue
> > > > level
> > > > > > (YK
> > > > > > > already have requested resource usage metrics)
> > > > > > >
> > > > > > > - key use case:
> > > > > > >
> > > > > > > - as indicators of job optimization for ICs like data
> > > > > > engineers/scientists,
> > > > > > > to show users how much resources they requested v.s. how much
> > > > resources
> > > > > > > their jobs actually used
> > > > > > >
> > > > > > > - as indicator for managers on their team's resource
> utilization.
> > > In
> > > > > our
> > > > > > > setup or a typical YK setup, each customer team has their own
> > > > YuniKorn
> > > > > > > queue in a shared, multi tenant environment. Managers of the
> team
> > > > would
> > > > > > > want high level (queue) metrics rather than low level (job)
> ones
> > > > > > >
> > > > > > > Currently we haven't found a good product on the market to do
> > this,
> > > > so
> > > > > > > would be great if YuniKorn can support it. Would like your
> input
> > > here
> > > > > on
> > > > > > > feasibility (seems feasible according Weiwei's comment in
> Jira),
> > > > > > priority,
> > > > > > > and timeline/complexity of the projects.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Bowen
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>