You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by lm...@apache.org on 2021/03/02 01:23:42 UTC

[tvm-site] branch pr-ansor-blog created (now 8f41697)

This is an automated email from the ASF dual-hosted git repository.

lmzheng pushed a change to branch pr-ansor-blog
in repository https://gitbox.apache.org/repos/asf/tvm-site.git.


      at 8f41697  add ansor blog

This branch includes the following new commits:

     new 8f41697  add ansor blog

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.

[tvm-site] 01/01: add ansor blog

Posted by lm...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

lmzheng pushed a commit to branch pr-ansor-blog
in repository https://gitbox.apache.org/repos/asf/tvm-site.git

commit 8f416972a358de045c81fcd78adf8b6bde38f035
Author: Lianmin Zheng <li...@gmail.com>
AuthorDate: Mon Mar 1 17:23:17 2021 -0800

    add ansor blog
---
 _posts/2021-03-01-intro-auto-scheduler.md       | 131 ++++++++++++++++++++++++
 images/intro-auto-scheduler/code_perf.png       | Bin 0 -> 36724 bytes
 images/intro-auto-scheduler/search_overview.png | Bin 0 -> 433415 bytes
 images/intro-auto-scheduler/search_time.png     | Bin 0 -> 45583 bytes
 images/intro-auto-scheduler/workflow.png        | Bin 0 -> 1014076 bytes
 5 files changed, 131 insertions(+)

diff --git a/_posts/2021-03-01-intro-auto-scheduler.md b/_posts/2021-03-01-intro-auto-scheduler.md
new file mode 100644
index 0000000..dfdad51
--- /dev/null
+++ b/_posts/2021-03-01-intro-auto-scheduler.md
@@ -0,0 +1,131 @@
+---
+layout: post
+title: Introducing TVM Auto-scheduler (a.k.a. Ansor)
+date: 2021-03-01
+author: Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu
+---
+
+Optimizing the execution speed of deep neural networks is extremely hard with the growing
+model size, operator diversity, and hardware heterogeneity.
+From a computational perspective, deep neural networks are just layers and layers of tensor computations.
+These tensor computations, such as matmul and conv2d, can be easily described by mathematical expressions.
+However, providing high-performance implementations for them on modern hardware can be very challenging.
+We have to apply various low-level optimizations and utilize special hardware intrinsics to achieve high performance.
+It takes huge engineering effort to build linear algebra and neural network acceleration libraries like CuBLAS, CuDNN, oneMKL, and oneDNN.
+
+Our life will be much easier if we can just write mathematical expressions and have something
+magically turn them into efficient code implementations.
+Three years ago, we built AutoTVM as the first step towards this goal.
+AutoTVM employs a template-based search algorithm to find efficient implementations for a given tensor computation.
+However, it is a template-based approach, so it still requires domain experts to implement a non-trivial manual template
+for every operator on every platform.
+Today, there are more than 15k lines of code for these templates in the TVM code repository.
+Besides being very hard to develop, these templates often have limited search spaces,
+making them unable to achieve optimal performance.
+
+To address the limitations of AutoTVM, we started project Ansor to build a fully automated auto-scheduler for code generation.
+Ansor auto-scheduler only takes tensor expressions as input and generates high-performance code without manual templates.
+We made innovations in the search space construction and search algorithm.
+As a result, the auto-scheduler can achieve better performance with less search time in a more automated way.
+
+Ansor auto-scheduler is now integrated into Apache TVM as `tvm.auto_scheduler` package.
+Detailed tutorials are available for Intel CPUs, ARM CPUs, NVIDIA CPUs, and Mali GPUs on the TVM website [1].
+In this blog post, we will give a high-level introduction and show some benchmark results.
+
+# System Overview
+
+## Autotvm vs Auto-scheduler
+{:center: style="text-align: center"}
+![image](/images/intro-auto-scheduler/workflow.png){: width="75%"}
+{:center}
+<center> Table 1. Workflow Comparision </center> <p></p>
+
+Table 1 compares the workflow for generating code for an operator in AutoTVM and auto-scheduler.
+In AutoTVM, the developer has to go through three steps.
+In step 1, the developer has to write the compute definition in TVM's tensor expression language.
+This part is relatively easy because TVM's tensor expression language looks just like math expressions.
+In step 2, the developer has to write a schedule template, which typically consists of 20-100 lines of tricky DSL code.
+This part requires domain expertise of both the target hardware architecture and operator semantics, so it is difficult.
+The last step, step 3, is automated by a search algorithm.
+
+In auto-scheduler, we eliminate the most difficult step 2 by automatic search space construction and accelerate step 3 with a better search algorithm.
+By doing automatic search space construction, we not only eliminate huge manual effort, 
+but also enabling the exploration of much more optimization combinations.
+This automation does not come free, because we still need to design rules to generate the search space.
+However, these rules are very general. They are based on static analysis of the tensor expressions.
+We only need to design a few general rules once and can apply them to almost all tensor computations in deep learning.
+
+## Search Process
+{:center: style="text-align: center"}
+![image](/images/intro-auto-scheduler/search_overview.png){: width="40%"}
+{:center}
+<center> Figure 1. Search Process Overview  </center> <p></p>
+
+Figure 1. shows the search process of auto-scheduler when optiming a whole neural network.
+The system takes deep learning models as input.
+It then partitions the big model into small subgraphs with Relay's operator fusion pass.
+A task scheduler is utilized to allocate the time resource for optimizing many subgraphs.
+At each iteration, it picks a subgraph that has the most potential to increase the end-to-end performance.
+For this subgraph, we analysis its tensor expression and generates several sketches for it.
+Then we run evolutionary search with a learned cost model to get a batch of optimized programs.
+The optimized programs are sent to actual hardware for measurements.
+When the measurements are finished, the profiling results are used as feedback to update all components of the system.
+This process is repeated iteratively until the optimization converges or we run out of time budget.
+More technical details can be found in our paper [3] and our code.
+
+The auto-scheduler reuses the existing computation definitions in TOPI but does not use any schedule template.
+
+
+# Benchmark Results
+In this section, we benchmark the performance of AutoTVM and Auto-scheduler.
+The CPU benchmark is done on an AWS c5.9xlarge, which is equipped with an Intel 18-core skylake 8124-m CPU. 
+The GPU benchmark is done on an AWS g4dn.4xlarge, which is equipped with an NVIDIA T4 GPU.
+All benchmark code, raw data, tuning logs can be found in this repo [2].
+
+## Performance of the generated code
+We benchmark the fp32 single-batch inference latency on three networks.
+Figure 2 shows the relative speedup auto-scheduler can get against AutoTVM.
+We can see auto-scheduler outperforms AutoTVM in all cases with 1.02x to 8.95x speedup.
+This is because auto-scheduler explores a larger search space. Auto-scheduler can find more efficient combinations
+of optimizations that are missed by manual templates.
+The BERT-base@GPU is an extreme case where the manual templates are very badly designed.
+The manual template for dense layers does not perform well for the shapes in BERT model.
+
+{:center: style="text-align: center"}
+![image](/images/intro-auto-scheduler/code_perf.png){: width="85%"}
+{:center}
+<center> Figure 2. Code Performance Comparision (Higher is better) </center> <p></p>
+
+## Search Time
+As we know, the search-based approaches can be very time-consuming, so we also care about the search time.
+It typically takes several hours to let the search converge for a single neural network.
+Figure 3 compares the search time of AutoTVM and auto-scheduler.
+Auto-scheduler requires much less time to converge in most cases, despite its larger search space.
+This is because of auto-scheduler has a better cost model and task scheduler.
+
+{:center: style="text-align: center"}
+![image](/images/intro-auto-scheduler/search_time.png){: width="85%"}
+{:center}
+<center> Figure 3. Search Time Comparision (Lower is better) </center> <p></p>
+
+## More Results
+The repo above serves as an internal benchmark tool for TVM, so it only compares the latest AutoTVM and AutoScheduler.
+You can find results for more libraries and backends in our paper [3].
+Recently, this blog post [4] also tried auto-scheduler on an Apple M1 chip and got some good results.
+
+# Conclusion
+We build TVM auto-scheduler, a system that automatically generates high-performance code for tensor expressions.
+Compared with the predecessor AutoTVM, auto-scheduler does not require manual templates.
+Besides, auto-scheduler generates better code with less search time.
+We achieve this by making innovations in the search space construction and search algorithm.
+
+We are excited about the current performance of auto-scheduler.
+In the future, we are interested in extending the ability of auto-scheduler to support
+sparse operators, low-precision operators, and dynamic shape better.
+
+# Links
+[1] Tutorials: [https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling](https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling)  
+[2] Benchmark repo: [https://github.com/tlc-pack/TLCBench](https://github.com/tlc-pack/TLCBench)  
+[3] OSDI Paper: [Ansor : Generating High-Performance Tensor Programs for Deep Learning](https://arxiv.org/abs/2006.06762)  
+[4] Results on Apple M1 chip: [https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d](https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d).  
+
diff --git a/images/intro-auto-scheduler/code_perf.png b/images/intro-auto-scheduler/code_perf.png
new file mode 100644
index 0000000..d070a6e
Binary files /dev/null and b/images/intro-auto-scheduler/code_perf.png differ
diff --git a/images/intro-auto-scheduler/search_overview.png b/images/intro-auto-scheduler/search_overview.png
new file mode 100644
index 0000000..7b6f56d
Binary files /dev/null and b/images/intro-auto-scheduler/search_overview.png differ
diff --git a/images/intro-auto-scheduler/search_time.png b/images/intro-auto-scheduler/search_time.png
new file mode 100644
index 0000000..4bd700b
Binary files /dev/null and b/images/intro-auto-scheduler/search_time.png differ
diff --git a/images/intro-auto-scheduler/workflow.png b/images/intro-auto-scheduler/workflow.png
new file mode 100644
index 0000000..b2c7b26
Binary files /dev/null and b/images/intro-auto-scheduler/workflow.png differ