You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by ww...@apache.org on 2021/12/10 17:58:14 UTC
[incubator-yunikorn-site] branch master updated: [YUNIKORN-845] Publish benchmark result on web-site (#101)

This is an automated email from the ASF dual-hosted git repository.

wwei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-yunikorn-site.git


The following commit(s) were added to refs/heads/master by this push:
     new 12d9eec  [YUNIKORN-845] Publish benchmark result on web-site (#101)
12d9eec is described below

commit 12d9eeca8df28246362b597994c1db48134a2d0d
Author: YuTeng Chen <45...@users.noreply.github.com>
AuthorDate: Sat Dec 11 01:55:50 2021 +0800

    [YUNIKORN-845] Publish benchmark result on web-site (#101)
    
    Co-authored-by: Weiwei Yang <ww...@apache.org>
---
 .../version-0.12.0/assets/allocation_4k.png        | Bin 0 -> 68868 bytes
 .../assets/predicateComaparation.png               | Bin 0 -> 182417 bytes
 .../version-0.12.0/assets/predicate_4k.png         | Bin 0 -> 86281 bytes
 .../assets/scheduling_no_predicate_4k.png          | Bin 0 -> 96404 bytes
 .../assets/scheduling_with_predicate_4k_.png       | Bin 0 -> 107557 bytes
 .../version-0.12.0/assets/throughput_3types.png    | Bin 0 -> 173025 bytes
 .../version-0.12.0/assets/yunirkonVSdefault.png    | Bin 0 -> 156100 bytes
 .../evaluate_perf_function_with_kubemark.md        | 121 ++++++++++++---------
 8 files changed, 69 insertions(+), 52 deletions(-)

diff --git a/versioned_docs/version-0.12.0/assets/allocation_4k.png b/versioned_docs/version-0.12.0/assets/allocation_4k.png
new file mode 100644
index 0000000..03346f5
Binary files /dev/null and b/versioned_docs/version-0.12.0/assets/allocation_4k.png differ
diff --git a/versioned_docs/version-0.12.0/assets/predicateComaparation.png b/versioned_docs/version-0.12.0/assets/predicateComaparation.png
new file mode 100755
index 0000000..d3498c8
Binary files /dev/null and b/versioned_docs/version-0.12.0/assets/predicateComaparation.png differ
diff --git a/versioned_docs/version-0.12.0/assets/predicate_4k.png b/versioned_docs/version-0.12.0/assets/predicate_4k.png
new file mode 100644
index 0000000..850036c
Binary files /dev/null and b/versioned_docs/version-0.12.0/assets/predicate_4k.png differ
diff --git a/versioned_docs/version-0.12.0/assets/scheduling_no_predicate_4k.png b/versioned_docs/version-0.12.0/assets/scheduling_no_predicate_4k.png
new file mode 100644
index 0000000..0ebe41c
Binary files /dev/null and b/versioned_docs/version-0.12.0/assets/scheduling_no_predicate_4k.png differ
diff --git a/versioned_docs/version-0.12.0/assets/scheduling_with_predicate_4k_.png b/versioned_docs/version-0.12.0/assets/scheduling_with_predicate_4k_.png
new file mode 100644
index 0000000..2cee7c0
Binary files /dev/null and b/versioned_docs/version-0.12.0/assets/scheduling_with_predicate_4k_.png differ
diff --git a/versioned_docs/version-0.12.0/assets/throughput_3types.png b/versioned_docs/version-0.12.0/assets/throughput_3types.png
new file mode 100644
index 0000000..a4a583b
Binary files /dev/null and b/versioned_docs/version-0.12.0/assets/throughput_3types.png differ
diff --git a/versioned_docs/version-0.12.0/assets/yunirkonVSdefault.png b/versioned_docs/version-0.12.0/assets/yunirkonVSdefault.png
new file mode 100755
index 0000000..a123ff8
Binary files /dev/null and b/versioned_docs/version-0.12.0/assets/yunirkonVSdefault.png differ
diff --git a/versioned_docs/version-0.12.0/performance/evaluate_perf_function_with_kubemark.md b/versioned_docs/version-0.12.0/performance/evaluate_perf_function_with_kubemark.md
index f2c7090..df244c2 100644
--- a/versioned_docs/version-0.12.0/performance/evaluate_perf_function_with_kubemark.md
+++ b/versioned_docs/version-0.12.0/performance/evaluate_perf_function_with_kubemark.md
@@ -1,6 +1,6 @@
 ---
 id: evaluate_perf_function_with_kubemark
-title: Evaluate YuniKorn function & performance with Kubemark
+title: Evaluate YuniKorn Performance with Kubemark
 keywords:
  - performance
  - throughput
@@ -25,79 +25,96 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-All the following tests are done with [Kubemark](https://github.com/kubernetes/kubernetes/blob/release-1.3/docs/devel/kubemark-guide.md#starting-a-kubemark-cluster),
-a tool helps us to simulate large K8s cluster and run experimental workloads.
-There were 18 bare-metal servers being used to simulate 2000/4000 nodes for these tests. 
+The YuniKorn community concerns about the scheduler’s performance and continues to optimize it over the releases. The community has developed some tools in order to test and tune the performance repeatedly.
+
+## Environment setup 
+
+We leverage [Kubemark](https://github.com/kubernetes/kubernetes/blob/release-1.3/docs/devel/kubemark-guide.md#starting-a-kubemark-cluster) to evaluate scheduler’s performance. Kubemark is a testing tool that simulates large scale clusters. It create hollow nodes which runs hollow kubelet to pretend original kubelet behavior. Scheduled pods on these hollow nodes won’t actually execute. It is able to create a big cluster that meets our experiment requirement that reveals the yunikorn sched [...]
 
 ## Scheduler Throughput
 
-When running Big Data batch workloads, e.g Spark, on K8s, scheduler throughput becomes to be one of the main concerns.
-In YuniKorn, we have done lots of optimizations to improve the performance, such as a fully async event-driven system
-and low-latency sorting policies. The following chart reveals the scheduler throughput (by using Kubemark simulated
-environment, and submitting 50,000 pods), comparing to the K8s default scheduler.
+We have designed some simple benchmarking scenarios on a simulated large scale environment in order to evaluate the scheduler performance. Our tools measure the [throughput](https://en.wikipedia.org/wiki/Throughput) and use these key metrics to evaluate the performance. In a nutshull, scheduler throughput is the rate of processing pods from discovering them on the cluster to allocating them to nodes.
+
+In this experiment, we setup a simulated 2000/4000 nodes cluster with [Kubemark](https://github.com/kubernetes/kubernetes/blob/release-1.3/docs/devel/kubemark-guide.md#starting-a-kubemark-cluster). Then we launch 10 [deployments](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/), with setting replicas to 5000 in each deployment respectively. This simulates large scale workloads submitting to the K8s cluster simultaneously. Our tool periodically monitors and checks po [...]
+
+![Scheduler Throughput](./../assets/yunirkonVSdefault.png)
+<p align="center">Fig 1. Yunikorn and default scheduler throughput </p>
+
+The charts record the time spent until all pods are running on the cluster:
+
+|  Number of Nodes  | yunikorn        | k8s default scheduler		| Diff    |
+|------------------	|:--------------:	|:---------------------: |:-----:  |
+| 2000(nodes)       | 204(pods/sec)			| 49(pods/sec)			        |   416%  |
+| 4000(nodes)       | 115(pods/sec)			| 48(pods/sec)			        |   240%  |
+
+In order to normalize the result, we have been running the tests for several rounds. As shown above, YuniKorn achieves a `2x` ~ `4x` performance gain comparing to the default scheduler.
+
+:::note
+
+Like other performance testing, the result is highly variable depending on the underlying hardware, e.g server CPU/memory, network bandwidth, I/O speed, etc. To get an accurate result that applies to your environment, we encourage you to run these tests on a cluster that is close to your production environment.
+
+:::
+
+## Performance Analysis
+
+The results we got from the experiment are promising. We further take a deep dive to analyze the performance by observing more internal YuniKorn metrics, and we are able to locate a few key areas affecting the performance.
+
+### K8s Limitation
+
+We found the overall performance actually is capped by the K8s master services, such as api-server, controller-manager and etcd, it did not reached the limit of YuniKorn in all our experiments. If you look at the internal scheduling metrics, you can see:
+
+![Allocation latency](./../assets/allocation_4k.png)
+<p align="center">Fig 2. Yunikorn metric in 4k nodes </p>
+
+Figure 2 is a screenshot from Prometheus, which records the [internal metrics](performance/metrics.md) `containerAllocation` in YuniKorn. They are the number of pods being allocated by the scheduler, but have not necessarily been bound to nodes. It consumes roughly 122 seconds to finish scheduling 50k pods, i.e 410 pod/sec. The actual throughput drops to 115 pods/sec, and the extra time was used to bind the pods on different nodes. If K8s side could catch up, we will see a better result. [...]
 
-![Scheduler Throughput](./../assets/throughput.png)
+### Node Sorting
 
-The charts record the time spent until all pods are running on the cluster
+When the cluster size grows, we see an obvious performance drop in YuniKorn. This is because in YuniKorn, we do a full sorting of the cluster nodes in order to find the **"best-fit"** node for a given pod. Such strategy makes the pods distribution more optimal based on the [node sorting policy](./../user_guide/sorting_policies#node-sorting) being used. However, sorting nodes is expensive, doing this in the scheduling cycle creates a lot of overhead. To overcome this, we have improved our [...]
 
-|                       	| THROUGHPUT (pods/sec) 	| THROUGHPUT (pods/sec) 	|
-|-----------------------	|:---------------------:	|:---------------------:	|
-| ENVIRONMENT (# nodes) 	|   Default Scheduler   	|        YuniKorn       	|
-| 2000                  	| 263                   	| 617                   	|
-| 4000                  	| 141                   	| 373                   	|
+### Per Node Precondition Checks
 
-## Resource Fairness between queues
+In each scheduling cycle, another time consuming part is the "Precondition Checks" for a node. In this phase, YuniKorn evaluates all K8s standard predicates, e.g node-selector, pod affinity/anti-affinity, etc, in order to qualify a pod is fit onto a node. These evaluations are expensive.
 
-Each of YuniKorn queues has its guaranteed and maximum capacity. When we have lots of jobs submitted to these queues,
-YuniKorn ensures each of them gets its fair share. When we monitor the resource usage of these queues, we can clearly
-see how fairness was enforced:
+We have done two experiments to compare the case where the predicates evaluation was enabled with being disabled. See the results below:
 
-![Queue Fairness](./../assets/queue-fairness.png)
+![Allocation latency](./../assets/predicateComaparation.png)
+<p align="center">Fig 3. Predicate effect comparison in yunikorn </p>
 
-We set up 4 heterogeneous queues on this cluster, and submit different workloads against these queues.
-From the chart, we can see the queue resources are increasing nearly in the same trend, which means the resource
-fairness across queues is honored.
+When the predicates evaluation is disabled, the throughput improves a lot. We looked further into the latency distribution of the entire scheduling cycle and the predicates-eval latency. And found: 
 
-## Node sorting policies
+![YK predicate latency](./../assets/predicate_4k.png)
+<p align="center">Fig 4. predicate latency </p>
 
-There are 2 node sorting policies available in YuniKorn, with regarding the pod distributing flavors. One is *FAIR*,
-which tries best to evenly distribute pods to nodes; the other one is *BIN-PACKING*, which tries best to bin pack pods
-to less number of nodes. The former one is suitable for the Data Center scenarios, it helps to balance the stress of
-cluster nodes; the latter one is suitable to be used on Cloud, it can minimize the number of instances when working
-with auto-scaler, in order to save cost.
+![YK scheduling with predicate](./../assets/scheduling_with_predicate_4k_.png)
+<p align="center">Fig 5. Scheduling time with predicate active </p>
 
-### FAIR Policy
+![YK scheduling with no predicate](./../assets/scheduling_no_predicate_4k.png)
+<p align="center">Fig 6. Scheduling time with predicate inactive </p>
 
-We group nodes into 10 buckets, each bucket represents for the number of nodes that has a similar resource
-utilization (a range).  To help you understand the chart, imagine the buckets have the following values at a certain
-point of time:
+Overall, YuniKorn scheduling cycle runs really fast, and the latency drops in **0.001s - 0.01s** range per cycle. And the majority of the time was used for predicates evaluation, 10x to other parts in the scheduling cycle.
 
-|   BUCKET 	| RESOURCE UTILIZATION RANGE 	| VALUE 	|
-|:--------:	|:--------------------------:	|:-----:	|
-| bucket-0 	| 0% - 10%                   	| 100   	|
-| bucket-1 	| 10% - 20%                  	| 300   	|
-| ...      	|                            	|       	|
-| bucket-9 	| 90% - 100%                 	| 0     	|
+|				| scheduling latency distribution(second)	| predicates-eval latency distribution(second)	|
+|-----------------------	|:---------------------:		|:---------------------:			|
+| predicates enabled		| 0.01 - 0.1				| 0.01-0.1					|
+| predicates disabled		| 0.001 - 0.01				| none						|
 
-This means at the given time, this cluster has 100 nodes whose utilization is in the range 0% to 10%;
-it has 300 nodes whose utilization is in the range 10% - 20%, and so on… Now, we run lots of workloads and
-collect metrics, see the below chart:
+## Why YuniKorn is faster?
 
-![Node Fairness](./../assets/node-fair.png)
+The default scheduler was created as a service-oriented scheduler; it is less sensitive in terms of throughput compared to YuniKorn. YuniKorn community works really hard to keep the performance outstanding in the line and keep improving it. The reasons that YuniKorn can run faster than the default scheduler are:
 
-We can see all nodes have 0% utilization, and then all of them move to bucket-1, then bucket-2 … and eventually
-all nodes moved to bucket-9, which means all capacity is used. In another word, nodes’ resource has been used in
-a fairness manner.
+* Short Circuit Scheduling Cycle
 
-### BIN-PACKING
+YuniKorn keeps the scheduling cycle short and efficient. YuniKorn uses all async communication protocol to make sure all the critical paths are non-blocking calls. Most of the places are just doing in-memory calculation which can be highly efficient. The default scheduler leverages [scheduling framework](https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/), it provides lots of flexibility to extend the scheduler, however, the trade-off is the performance. The sc [...]
 
-This is When the bin-packing policy is enabled, we can see the following pattern:
+* Async Events Handling
 
-![Node Bin-Packing](./../assets/node-bin-packing.png)
+YuniKorn leverages an async event handling framework to deal with internal states. And this allows the core scheduling cycle can run fast without being blocked by any expensive calls. An example is the default scheduler needs to write state updates, events to pod objects, that is done inside of the scheduling cycle. This involves persisting data to etcd which could be slow. YuniKorn, instead, caches all such events in a queue and writes back to the pod in asynchronous manner. 
 
-On the contrary, all nodes are moving between 2 buckets, bucket-0 and bucket-9. Nodes in bucket-0 (0% - 10%)
-are decreasing in a linear manner, and nodes in bucket-9 (90% - 100%) are increasing with the same curve.
-In other words, node resources are being used up one by one.
+* Faster Node Sorting
 
+After [YUNIKORN-807](https://issues.apache.org/jira/browse/YUNIKORN-807), YuniKorn does the incremental node sorting which is highly efficient. This is built on top of the so-called "resource-weight" based node scoring mechanism, and it is also extensible via plugins. All these together reduce the overhead while computing node scores. In comparison, the default scheduler provides a few extension points for calculating node scores, such as `PreScore`, `Score` and `NormalizeScore`. These c [...]
 
+## Summary
 
+During the tests, we found YuniKorn is performing really well, especially compared to the default scheduler. We have identified the major factors in YuniKorn where we can continue to improve the performance, and also explained why YuniKorn is performing better than the default scheduler. We also realized the limitations while scaling Kubernetes to thousands of nodes, that can be alleviated by using other techiques such as, e.g federation. At a result, YuniKorn is a high-efficient, high-t [...]