You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by lm...@apache.org on 2021/03/03 09:21:03 UTC

[tvm-site] branch asf-site updated: Build at Wed Mar 3 01:20:50 PST 2021

This is an automated email from the ASF dual-hosted git repository.

lmzheng pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/tvm-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 8828684  Build at Wed Mar  3 01:20:50 PST 2021
8828684 is described below

commit 88286848906663319587f02c11f37dd2fe696f30
Author: Lianmin Zheng <li...@gmail.com>
AuthorDate: Wed Mar 3 01:20:50 2021 -0800

    Build at Wed Mar  3 01:20:50 PST 2021
---
 2017/08/17/tvm-release-announcement.html           |   2 +-
 ...s-with-TVM-A-Depthwise-Convolution-Example.html |   2 +-
 2017/10/06/nnvm-compiler-announcement.html         |   2 +-
 ...s-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html |   2 +-
 2017/11/08/android-rpc-introduction.html           |   2 +-
 2018/01/16/opt-mali-gpu.html                       |   2 +-
 2018/03/12/webgl.html                              |   2 +-
 2018/03/23/nmt-transformer-optimize.html           |   2 +-
 2018/07/12/vta-release-announcement.html           |   2 +-
 2018/08/10/DLPack-Bridge.html                      |   2 +-
 2018/10/03/auto-opt-all.html                       |   2 +-
 2018/10/09/ml-in-tees.html                         |   2 +-
 2018/12/18/lowprecision-conv.html                  |   2 +-
 2019/01/19/Golang.html                             |   2 +-
 2019/03/18/tvm-apache-announcement.html            |   2 +-
 2019/04/29/opt-cuda-quantized.html                 |   2 +-
 2019/05/30/pytorch-frontend.html                   |   2 +-
 ...machine-learning-to-webassembly-and-webgpu.html |   2 +-
 2020/06/04/tinyml-how-tvm-is-taming-tiny.html      |   2 +-
 2020/07/14/bert-pytorch-tvm.html                   |   2 +-
 .../15/how-to-bring-your-own-codegen-to-tvm.html   |   2 +-
 2020/09/26/bring-your-own-datatypes.html           |   2 +-
 2021/03/03/intro-auto-scheduler.html               | 321 +++++++++++++++++++++
 atom.xml                                           | 253 +++++++++-------
 blog.html                                          |  10 +
 community.html                                     |   4 +
 feed.xml                                           | 291 +++++++++----------
 images/community/sjtu.png                          | Bin 0 -> 236508 bytes
 images/intro-auto-scheduler/code_perf.png          | Bin 0 -> 36724 bytes
 images/intro-auto-scheduler/search_overview.png    | Bin 0 -> 433415 bytes
 images/intro-auto-scheduler/search_time.png        | Bin 0 -> 45583 bytes
 images/intro-auto-scheduler/workflow.png           | Bin 0 -> 1014076 bytes
 rss.xml                                            | 255 +++++++++-------
 sitemap.txt                                        |   1 +
 34 files changed, 789 insertions(+), 390 deletions(-)

diff --git a/2017/08/17/tvm-release-announcement.html b/2017/08/17/tvm-release-announcement.html
index ea95cf0..dbd65e1 100644
--- a/2017/08/17/tvm-release-announcement.html
+++ b/2017/08/17/tvm-release-announcement.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>TVM: An End to End IR Stack for Deploying Deep Learning Workloads on Hardware Platforms </h1>
       <p class="post-meta">
-        <time datetime="2017-08-17T15:00:00-04:00" itemprop="datePublished">
+        <time datetime="2017-08-17T12:00:00-07:00" itemprop="datePublished">
           Aug 17, 2017
         </time>
         
diff --git a/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example.html b/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example.html
index 96b2e16..13a15a3 100644
--- a/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example.html
+++ b/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Optimize Deep Learning GPU Operators with TVM: A Depthwise Convolution Example </h1>
       <p class="post-meta">
-        <time datetime="2017-08-22T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2017-08-22T00:00:00-07:00" itemprop="datePublished">
           Aug 22, 2017
         </time>
         
diff --git a/2017/10/06/nnvm-compiler-announcement.html b/2017/10/06/nnvm-compiler-announcement.html
index 40557e0..b627ca6 100644
--- a/2017/10/06/nnvm-compiler-announcement.html
+++ b/2017/10/06/nnvm-compiler-announcement.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>NNVM Compiler: Open Compiler for AI Frameworks </h1>
       <p class="post-meta">
-        <time datetime="2017-10-06T11:30:00-04:00" itemprop="datePublished">
+        <time datetime="2017-10-06T08:30:00-07:00" itemprop="datePublished">
           Oct 6, 2017
         </time>
         
diff --git a/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html b/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html
index 06f20bd..e6a6c2f 100644
--- a/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html
+++ b/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Bringing AMDGPUs to TVM Stack and NNVM Compiler with ROCm </h1>
       <p class="post-meta">
-        <time datetime="2017-10-30T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2017-10-30T00:00:00-07:00" itemprop="datePublished">
           Oct 30, 2017
         </time>
         
diff --git a/2017/11/08/android-rpc-introduction.html b/2017/11/08/android-rpc-introduction.html
index 7d15d82..f7e34b5 100644
--- a/2017/11/08/android-rpc-introduction.html
+++ b/2017/11/08/android-rpc-introduction.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Remote Profile and Test Deep Learning Cross Compilation on Mobile Phones with TVM RPC </h1>
       <p class="post-meta">
-        <time datetime="2017-11-08T00:00:00-05:00" itemprop="datePublished">
+        <time datetime="2017-11-08T00:00:00-08:00" itemprop="datePublished">
           Nov 8, 2017
         </time>
         
diff --git a/2018/01/16/opt-mali-gpu.html b/2018/01/16/opt-mali-gpu.html
index a039779..40fc7f0 100644
--- a/2018/01/16/opt-mali-gpu.html
+++ b/2018/01/16/opt-mali-gpu.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Optimizing Mobile Deep Learning on ARM GPU with TVM </h1>
       <p class="post-meta">
-        <time datetime="2018-01-16T00:00:00-05:00" itemprop="datePublished">
+        <time datetime="2018-01-16T00:00:00-08:00" itemprop="datePublished">
           Jan 16, 2018
         </time>
         
diff --git a/2018/03/12/webgl.html b/2018/03/12/webgl.html
index 792c922..74313b5 100644
--- a/2018/03/12/webgl.html
+++ b/2018/03/12/webgl.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Compiling Deep Learning Models to WebGL with TVM </h1>
       <p class="post-meta">
-        <time datetime="2018-03-12T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2018-03-12T00:00:00-07:00" itemprop="datePublished">
           Mar 12, 2018
         </time>
         
diff --git a/2018/03/23/nmt-transformer-optimize.html b/2018/03/23/nmt-transformer-optimize.html
index 2182327..35c211a 100644
--- a/2018/03/23/nmt-transformer-optimize.html
+++ b/2018/03/23/nmt-transformer-optimize.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Bringing TVM into TensorFlow for Optimizing Neural Machine Translation on GPU </h1>
       <p class="post-meta">
-        <time datetime="2018-03-23T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2018-03-23T00:00:00-07:00" itemprop="datePublished">
           Mar 23, 2018
         </time>
         
diff --git a/2018/07/12/vta-release-announcement.html b/2018/07/12/vta-release-announcement.html
index c60a3e1..1250749 100644
--- a/2018/07/12/vta-release-announcement.html
+++ b/2018/07/12/vta-release-announcement.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>VTA: An Open, Customizable Deep Learning Acceleration Stack  </h1>
       <p class="post-meta">
-        <time datetime="2018-07-12T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2018-07-12T00:00:00-07:00" itemprop="datePublished">
           Jul 12, 2018
         </time>
         
diff --git a/2018/08/10/DLPack-Bridge.html b/2018/08/10/DLPack-Bridge.html
index 7ec1aaa..af4d193 100644
--- a/2018/08/10/DLPack-Bridge.html
+++ b/2018/08/10/DLPack-Bridge.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Building a Cross-Framework Deep Learning Compiler via DLPack </h1>
       <p class="post-meta">
-        <time datetime="2018-08-10T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2018-08-10T00:00:00-07:00" itemprop="datePublished">
           Aug 10, 2018
         </time>
         
diff --git a/2018/10/03/auto-opt-all.html b/2018/10/03/auto-opt-all.html
index 98269c7..ac36190 100644
--- a/2018/10/03/auto-opt-all.html
+++ b/2018/10/03/auto-opt-all.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Automatic Kernel Optimization for Deep Learning on All Hardware Platforms </h1>
       <p class="post-meta">
-        <time datetime="2018-10-03T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2018-10-03T00:00:00-07:00" itemprop="datePublished">
           Oct 3, 2018
         </time>
         
diff --git a/2018/10/09/ml-in-tees.html b/2018/10/09/ml-in-tees.html
index 992e1a3..0f59a69 100644
--- a/2018/10/09/ml-in-tees.html
+++ b/2018/10/09/ml-in-tees.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Efficient Privacy-Preserving ML Using TVM </h1>
       <p class="post-meta">
-        <time datetime="2018-10-09T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2018-10-09T00:00:00-07:00" itemprop="datePublished">
           Oct 9, 2018
         </time>
         
diff --git a/2018/12/18/lowprecision-conv.html b/2018/12/18/lowprecision-conv.html
index c5def47..f32251d 100644
--- a/2018/12/18/lowprecision-conv.html
+++ b/2018/12/18/lowprecision-conv.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Automating Generation of Low Precision Deep Learning Operators </h1>
       <p class="post-meta">
-        <time datetime="2018-12-18T00:00:00-05:00" itemprop="datePublished">
+        <time datetime="2018-12-18T00:00:00-08:00" itemprop="datePublished">
           Dec 18, 2018
         </time>
         
diff --git a/2019/01/19/Golang.html b/2019/01/19/Golang.html
index 27a39f0..6b8b94a 100644
--- a/2019/01/19/Golang.html
+++ b/2019/01/19/Golang.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>TVM Golang Runtime for Deep Learning Deployment </h1>
       <p class="post-meta">
-        <time datetime="2019-01-19T00:00:00-05:00" itemprop="datePublished">
+        <time datetime="2019-01-19T00:00:00-08:00" itemprop="datePublished">
           Jan 19, 2019
         </time>
         
diff --git a/2019/03/18/tvm-apache-announcement.html b/2019/03/18/tvm-apache-announcement.html
index 386de84..19b5017 100644
--- a/2019/03/18/tvm-apache-announcement.html
+++ b/2019/03/18/tvm-apache-announcement.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>TVM Deep Learning Compiler Joins Apache Software Foundation </h1>
       <p class="post-meta">
-        <time datetime="2019-03-18T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2019-03-18T00:00:00-07:00" itemprop="datePublished">
           Mar 18, 2019
         </time>
         
diff --git a/2019/04/29/opt-cuda-quantized.html b/2019/04/29/opt-cuda-quantized.html
index 3b401af..1c55a9a 100644
--- a/2019/04/29/opt-cuda-quantized.html
+++ b/2019/04/29/opt-cuda-quantized.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Automating Optimization of Quantized Deep Learning Models on CUDA </h1>
       <p class="post-meta">
-        <time datetime="2019-04-29T12:00:00-04:00" itemprop="datePublished">
+        <time datetime="2019-04-29T09:00:00-07:00" itemprop="datePublished">
           Apr 29, 2019
         </time>
         
diff --git a/2019/05/30/pytorch-frontend.html b/2019/05/30/pytorch-frontend.html
index ad8281b..a4dd9a3 100644
--- a/2019/05/30/pytorch-frontend.html
+++ b/2019/05/30/pytorch-frontend.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Integrating TVM into PyTorch </h1>
       <p class="post-meta">
-        <time datetime="2019-05-30T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2019-05-30T00:00:00-07:00" itemprop="datePublished">
           May 30, 2019
         </time>
         
diff --git a/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu.html b/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu.html
index 38bd956..50f01e7 100644
--- a/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu.html
+++ b/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Compiling Machine Learning to WASM and WebGPU with Apache TVM </h1>
       <p class="post-meta">
-        <time datetime="2020-05-14T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2020-05-14T00:00:00-07:00" itemprop="datePublished">
           May 14, 2020
         </time>
         
diff --git a/2020/06/04/tinyml-how-tvm-is-taming-tiny.html b/2020/06/04/tinyml-how-tvm-is-taming-tiny.html
index bcb1aed..ec640c7 100644
--- a/2020/06/04/tinyml-how-tvm-is-taming-tiny.html
+++ b/2020/06/04/tinyml-how-tvm-is-taming-tiny.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>TinyML - How TVM is Taming Tiny </h1>
       <p class="post-meta">
-        <time datetime="2020-06-04T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2020-06-04T00:00:00-07:00" itemprop="datePublished">
           Jun 4, 2020
         </time>
         
diff --git a/2020/07/14/bert-pytorch-tvm.html b/2020/07/14/bert-pytorch-tvm.html
index a563504..387e219 100644
--- a/2020/07/14/bert-pytorch-tvm.html
+++ b/2020/07/14/bert-pytorch-tvm.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Bridging PyTorch and TVM </h1>
       <p class="post-meta">
-        <time datetime="2020-07-14T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2020-07-14T00:00:00-07:00" itemprop="datePublished">
           Jul 14, 2020
         </time>
         
diff --git a/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html b/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html
index a2066ec..3d39e96 100644
--- a/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html
+++ b/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>How to Bring Your Own Codegen to TVM </h1>
       <p class="post-meta">
-        <time datetime="2020-07-15T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2020-07-15T00:00:00-07:00" itemprop="datePublished">
           Jul 15, 2020
         </time>
         
diff --git a/2020/09/26/bring-your-own-datatypes.html b/2020/09/26/bring-your-own-datatypes.html
index 0dc4fb0..135d0db 100644
--- a/2020/09/26/bring-your-own-datatypes.html
+++ b/2020/09/26/bring-your-own-datatypes.html
@@ -140,7 +140,7 @@
     <div class="span14 w-100">
       <h1>Bring Your Own Datatypes: Enabling Custom Datatype Exploration in TVM </h1>
       <p class="post-meta">
-        <time datetime="2020-09-26T00:00:00-04:00" itemprop="datePublished">
+        <time datetime="2020-09-26T00:00:00-07:00" itemprop="datePublished">
           Sep 26, 2020
         </time>
         
diff --git a/2021/03/03/intro-auto-scheduler.html b/2021/03/03/intro-auto-scheduler.html
new file mode 100644
index 0000000..e10a971
--- /dev/null
+++ b/2021/03/03/intro-auto-scheduler.html
@@ -0,0 +1,321 @@
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Introducing TVM Auto-scheduler (a.k.a. Ansor)</title>
+    <link rel="shortcut icon" href="/assets/images/favicon.ico">
+    <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.3/css/bootstrap.min.css" integrity="sha384-MCw98/SFnGE8fJT3GXwEOngsV7Zt27NXFoaoApmYm81iuXoPkFOJwJ8ERdknLPMO" crossorigin="anonymous">
+    <link rel="stylesheet" href="/css/slick.css">
+    <link rel="stylesheet" href="/css/slick-theme.css">
+    <link rel="stylesheet" href="/css/custom.css">
+</head>
+<body>
+
+    
+<div class="bannerPage">
+      <header class="header">
+      <div class="container">
+        <div class="headerInner d-flex justify-content-between align-items-center">
+          <div class="headerLogo">
+            <a href="/"><img src="/assets/images/logo.svg" alt="Logo"></a>
+          </div>
+          <div id="headMenu" class="headerNav">
+            <button type="button" id="closeHeadMenu" class="navCloseBtn"><img src="/assets/images/close-icon.svg"
+                alt="Close"></button>
+                <ul class="nav">
+    
+    <li class="nav-item">
+        <a class="nav-link" href="/community">Community</a>
+    </li>
+    
+    <li class="nav-item">
+        <a class="nav-link" href="/download">Download</a>
+    </li>
+    
+    <li class="nav-item">
+        <a class="nav-link" href="/vta">VTA</a>
+    </li>
+    
+    <li class="nav-item">
+        <a class="nav-link" href="/blog">Blog</a>
+    </li>
+    
+    <li class="nav-item">
+        <a class="nav-link" href="https://tvm.apache.org/docs/">Docs</a>
+    </li>
+    
+    <li class="nav-item">
+        <a class="nav-link" href="https://tvmconf.org/">Conference</a>
+    </li>
+    
+    <li class="nav-item">
+        <a class="nav-link" href="https://github.com/apache/incubator-tvm/">Github</a>
+    </li>
+    
+</ul>
+            <div class="responsiveasfdropdown">
+              <button type="button" class="btn-link">
+                ASF
+              </button>
+              <ul>
+    
+    <li>
+        <a href="https://www.apache.org/">Apache Homepage</a>
+    </li>
+    
+    <li>
+        <a href="https://www.apache.org/licenses/">License</a>
+    </li>
+    
+    <li>
+        <a href="https://www.apache.org/foundation/sponsorship.html">Sponsorship</a>
+    </li>
+    
+    <li>
+        <a href="https://www.apache.org/security/">Security</a>
+    </li>
+    
+    <li>
+        <a href="https://www.apache.org/foundation/thanks.html">Thanks</a>
+    </li>
+    
+    <li>
+        <a href="https://www.apache.org/events/current-event">Events</a>
+    </li>
+    
+</ul>
+            </div>
+          </div>
+          <div class="responsiveMenuIcon">
+            <button type="button" id="menuBtn" class="btn-menu"><img src="/assets/images/menu-icon.svg"
+                alt="Menu Icon" /></button>
+          </div>
+          <div class="asfDropdown">
+            <div class="dropdown">
+              <button type="button" class="btn-link dropdown-toggle" data-toggle="dropdown" aria-haspopup="true"
+                aria-expanded="false">
+                ASF
+              </button>
+              <div class="dropdown-menu dropdown-menu-right">
+                <ul>
+    
+    <li>
+        <a href="https://www.apache.org/">Apache Homepage</a>
+    </li>
+    
+    <li>
+        <a href="https://www.apache.org/licenses/">License</a>
+    </li>
+    
+    <li>
+        <a href="https://www.apache.org/foundation/sponsorship.html">Sponsorship</a>
+    </li>
+    
+    <li>
+        <a href="https://www.apache.org/security/">Security</a>
+    </li>
+    
+    <li>
+        <a href="https://www.apache.org/foundation/thanks.html">Thanks</a>
+    </li>
+    
+    <li>
+        <a href="https://www.apache.org/events/current-event">Events</a>
+    </li>
+    
+</ul>
+              </div>
+            </div>
+          </div>
+        </div>
+      </div>
+    </header>
+
+</div>
+
+
+<div class="container">
+<div class="content">
+  <div class="row">
+    <div class="span14 w-100">
+      <h1>Introducing TVM Auto-scheduler (a.k.a. Ansor) </h1>
+      <p class="post-meta">
+        <time datetime="2021-03-03T00:00:00-08:00" itemprop="datePublished">
+          Mar 3, 2021
+        </time>
+        
+        • <span itemprop="author" itemscope itemtype="http://schema.org/Person">
+          <span itemprop="name">Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu</span>
+        </span>
+        
+      </p>
+      <p class="post-meta">
+        </p>
+    </br>
+    <p>Optimizing the execution speed of deep neural networks is extremely hard with the growing
+model size, operator diversity, and hardware heterogeneity.
+From a computational perspective, deep neural networks are just layers and layers of tensor computations.
+These tensor computations, such as matmul and conv2d, can be easily described by mathematical expressions.
+However, providing high-performance implementations for them on modern hardware can be very challenging.
+We have to apply various low-level optimizations and utilize special hardware intrinsics to achieve high performance.
+It takes huge engineering effort to build linear algebra and neural network acceleration libraries like CuBLAS, CuDNN, oneMKL, and oneDNN.</p>
+
+<p>Our life will be much easier if we can just write mathematical expressions and have something
+magically turn them into efficient code implementations.
+Three years ago, deep learning compiler TVM and its search module AutoTVM were built as the first step towards this goal.
+AutoTVM employs a template-based search algorithm to find efficient implementations for a given tensor computation.
+However, it is a template-based approach, so it still requires domain experts to implement a non-trivial manual template
+for every operator on every platform.
+Today, there are more than 15k lines of code for these templates in the TVM code repository.
+Besides being very hard to develop, these templates often have inefficient and limited search spaces,
+making them unable to achieve optimal performance.</p>
+
+<p>To address the limitations of AutoTVM, we started project Ansor aiming at a fully automated auto-scheduler for 
+generating code for tensor computations.
+Ansor auto-scheduler only takes tensor expressions as input and generates high-performance code without manual templates.
+We made innovations in the search space construction and search algorithm.
+As a result, the auto-scheduler can achieve better performance with less search time in a more automated way.</p>
+
+<p>Ansor auto-scheduler is now integrated into Apache TVM as <code class="language-plaintext highlighter-rouge">tvm.auto_scheduler</code> package.
+This is a joint effort by collaborators from UC Berkeley, Alibaba, AWS and OctoML.
+Detailed tutorials are available for Intel CPUs, ARM CPUs, NVIDIA CPUs, and Mali GPUs on the TVM website [1].
+In this blog post, we will give a high-level introduction and show some benchmark results.</p>
+
+<h1 id="system-overview">System Overview</h1>
+
+<h2 id="autotvm-vs-auto-scheduler">AutoTVM vs Auto-scheduler</h2>
+<p style="text-align: center"><img src="/images/intro-auto-scheduler/workflow.png" alt="image" width="75%" /></p>
+<center> Table 1. Workflow Comparision </center>
+<p></p>
+
+<p>Table 1 compares the workflow for generating code for an operator in AutoTVM and auto-scheduler.
+In AutoTVM, the developer has to go through three steps.
+In step 1, the developer has to write the compute definition in TVM’s tensor expression language.
+This part is relatively easy because TVM’s tensor expression language looks just like math expressions.
+In step 2, the developer has to write a schedule template, which typically consists of 20-100 lines of tricky DSL code.
+This part requires domain expertise of both the target hardware architecture and operator semantics, so it is difficult.
+The last step, step 3, is automated by a search algorithm.</p>
+
+<p>In auto-scheduler, we eliminate the most difficult step 2 by automatic search space construction and accelerate step 3 with a better search algorithm.
+By doing automatic search space construction, we not only eliminate huge manual effort, 
+but also enabling the exploration of much more optimization combinations.
+This automation does not come for free, because we still need to design rules to generate the search space.
+However, these rules are very general. They are based on static analysis of the tensor expressions.
+We only need to design a few general rules once and can apply them to almost all tensor computations in deep learning.</p>
+
+<h2 id="search-process">Search Process</h2>
+<p style="text-align: center"><img src="/images/intro-auto-scheduler/search_overview.png" alt="image" width="40%" /></p>
+<center> Figure 1. Search Process Overview  </center>
+<p></p>
+
+<p>Figure 1. shows the search process of auto-scheduler when optimizing a whole neural network.
+The system takes deep learning models as input.
+It then partitions the big model into small subgraphs with Relay’s operator fusion pass.
+A task scheduler is utilized to allocate the time resource for optimizing many subgraphs.
+At each iteration, it picks a subgraph that has the most potential to increase the end-to-end performance.
+For this subgraph, we analyze its tensor expression and generate several sketches for it.
+Then we run evolutionary search with a learned cost model to get a batch of optimized programs.
+The optimized programs are sent to actual hardware for measurements.
+When the measurements are finished, the profiling results are used as feedback to update all components of the system.
+This process is repeated iteratively until the optimization converges or we run out of time budget.
+More technical details can be found in our paper [3] and our code.</p>
+
+<p>It is worth notiing that since the auto-scheduler generates schedules from scratch, 
+it reuses the existing computation definitions in TOPI but not schedule templates.</p>
+
+<h1 id="benchmark-results">Benchmark Results</h1>
+<p>In this section, we benchmark the performance of AutoTVM and Auto-scheduler.
+The CPU benchmark is done on an AWS c5.9xlarge, which is equipped with an Intel 18-core skylake 8124-m CPU. 
+The GPU benchmark is done on an AWS g4dn.4xlarge, which is equipped with an NVIDIA T4 GPU.
+All benchmark code, raw data, tuning logs can be found in this repo [2].</p>
+
+<h2 id="performance-of-the-generated-code">Performance of the generated code</h2>
+<p>We benchmark the fp32 single-batch inference latency on three networks.
+Figure 2 shows the relative speedup of auto-scheduler against AutoTVM.
+We can see auto-scheduler outperforms AutoTVM in all cases with 1.02x to 8.95x speedup.
+This is because auto-scheduler explores a larger search space, which covers more efficient combinations
+of optimizations that are missed in TOPI manual templates.
+The BERT-base@GPU is an extreme case where the manual templates are very badly designed.
+In other words, the manual template for dense layers does not perform well for the shapes in BERT model.</p>
+
+<p style="text-align: center"><img src="/images/intro-auto-scheduler/code_perf.png" alt="image" width="85%" /></p>
+<center> Figure 2. Code Performance Comparision (Higher is better) </center>
+<p></p>
+
+<h2 id="search-time">Search Time</h2>
+<p>As we know, the search-based approaches can be very time-consuming, so we also care about the search time.
+It typically takes several hours to let the search converge for a single neural network.
+Figure 3 compares the search time of AutoTVM and auto-scheduler.
+Auto-scheduler requires much less time to converge in most cases, despite its larger search space.
+This is mainly because of auto-scheduler has a better cost model and task scheduler.</p>
+
+<p style="text-align: center"><img src="/images/intro-auto-scheduler/search_time.png" alt="image" width="85%" /></p>
+<center> Figure 3. Search Time Comparision (Lower is better) </center>
+<p></p>
+
+<h2 id="more-results">More Results</h2>
+<p>The repo above serves as an internal benchmark tool for TVM, so it only compares the latest AutoTVM and AutoScheduler.
+You can find results for more libraries and backends in our paper [3].
+Recently, this blog post [4] also tried auto-scheduler on an Apple M1 chip and got some good results.</p>
+
+<h1 id="conclusion">Conclusion</h1>
+<p>We build TVM auto-scheduler, a system that automatically generates high-performance code for tensor expressions.
+Compared with the predecessor AutoTVM, auto-scheduler does not require manual templates.
+Besides, auto-scheduler is capable of generating schedules with better performance in a shorter time.
+We achieve this by making innovations in the search space construction and search algorithm.</p>
+
+<p>We are excited about the current performance of auto-scheduler.
+In the future, we are interested in extending the ability of auto-scheduler to support
+sparse operators, low-precision operators, and dynamic shape better.</p>
+
+<h1 id="links">Links</h1>
+<p>[1] Tutorials: <a href="https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling">https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling</a><br />
+[2] Benchmark repo: <a href="https://github.com/tlc-pack/TLCBench">https://github.com/tlc-pack/TLCBench</a><br />
+[3] OSDI Paper: <a href="https://arxiv.org/abs/2006.06762">Ansor : Generating High-Performance Tensor Programs for Deep Learning</a><br />
+[4] Results on Apple M1 chip: <a href="https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d">https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d</a>.</p>
+
+
+    </div>
+  </div>
+</div>
+</div>
+
+    
+
+
+
+
+  <script src="https://code.jquery.com/jquery-2.2.0.min.js" type="text/javascript"></script>
+  <script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.3/umd/popper.min.js" integrity="sha384-ZMP7rVo3mIykV+2+9J3UJ46jBk0WLaUAdn689aCwoqbBJiSnjAK/l8WvCWPIPm49" crossorigin="anonymous"></script>
+  <script src="https://stackpath.bootstrapcdn.com/bootstrap/4.1.3/js/bootstrap.min.js" integrity="sha384-ChfqqxuZUCnJSK3+MXmPNIyE6ZbWh2IMqE241rYiqJxyMiZ6OW/JmZQ5stwEULTy" crossorigin="anonymous"></script>
+  <!-- <script src="./assets/js/slick.js"></script> -->
+  <script src="/assets/js/custome.js"></script>
+  <script async src="https://www.googletagmanager.com/gtag/js?id=UA-75982049-2"></script>
+  <script>
+    window.dataLayer = window.dataLayer || [];
+    function gtag(){dataLayer.push(arguments);}
+    gtag('js', new Date());
+    gtag('config', 'UA-75982049-2');
+  </script>
+</body>
+<section class="footerSec">
+  <div class="footerHeader">
+    <ul class="container d-flex align-md-items-center justify-content-between flex-column flex-md-row">
+      <li class="logo">
+
+        <p><a href="/"><img src="/assets/images/logo.svg" alt="logo" title="logo" /></a></p>
+      </li>
+      <li class="copywrite d-flex align-items-center">
+        <h5 id="apache-software-foundation--all-right-reserved">© 2020 Apache Software Foundation | All right reserved</h5>
+      </li>
+    </ul>
+
+  </div>
+
+  <ul class="container">
+    <li class="footernote">
+      Copyright © 2020 The Apache Software Foundation. Apache TVM, Apache, the Apache feather, and the Apache TVM project logo are either trademarks or registered trademarks of the Apache Software Foundation.</li>
+  </ul>
+
+</section>
+</html>
diff --git a/atom.xml b/atom.xml
index 84cd5f0..cb57f8a 100644
--- a/atom.xml
+++ b/atom.xml
@@ -4,7 +4,7 @@
  <title>TVM</title>
  <link href="https://tvm.apache.org" rel="self"/>
  <link href="https://tvm.apache.org"/>
- <updated>2021-01-04T16:22:52-05:00</updated>
+ <updated>2021-03-03T01:20:46-08:00</updated>
  <id>https://tvm.apache.org</id>
  <author>
    <name></name>
@@ -13,9 +13,139 @@
 
  
  <entry>
+   <title>Introducing TVM Auto-scheduler (a.k.a. Ansor)</title>
+   <link href="https://tvm.apache.org/2021/03/03/intro-auto-scheduler"/>
+   <updated>2021-03-03T00:00:00-08:00</updated>
+   <id>https://tvm.apache.org/2021/03/03/intro-auto-scheduler</id>
+   <content type="html">&lt;p&gt;Optimizing the execution speed of deep neural networks is extremely hard with the growing
+model size, operator diversity, and hardware heterogeneity.
+From a computational perspective, deep neural networks are just layers and layers of tensor computations.
+These tensor computations, such as matmul and conv2d, can be easily described by mathematical expressions.
+However, providing high-performance implementations for them on modern hardware can be very challenging.
+We have to apply various low-level optimizations and utilize special hardware intrinsics to achieve high performance.
+It takes huge engineering effort to build linear algebra and neural network acceleration libraries like CuBLAS, CuDNN, oneMKL, and oneDNN.&lt;/p&gt;
+
+&lt;p&gt;Our life will be much easier if we can just write mathematical expressions and have something
+magically turn them into efficient code implementations.
+Three years ago, deep learning compiler TVM and its search module AutoTVM were built as the first step towards this goal.
+AutoTVM employs a template-based search algorithm to find efficient implementations for a given tensor computation.
+However, it is a template-based approach, so it still requires domain experts to implement a non-trivial manual template
+for every operator on every platform.
+Today, there are more than 15k lines of code for these templates in the TVM code repository.
+Besides being very hard to develop, these templates often have inefficient and limited search spaces,
+making them unable to achieve optimal performance.&lt;/p&gt;
+
+&lt;p&gt;To address the limitations of AutoTVM, we started project Ansor aiming at a fully automated auto-scheduler for 
+generating code for tensor computations.
+Ansor auto-scheduler only takes tensor expressions as input and generates high-performance code without manual templates.
+We made innovations in the search space construction and search algorithm.
+As a result, the auto-scheduler can achieve better performance with less search time in a more automated way.&lt;/p&gt;
+
+&lt;p&gt;Ansor auto-scheduler is now integrated into Apache TVM as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tvm.auto_scheduler&lt;/code&gt; package.
+This is a joint effort by collaborators from UC Berkeley, Alibaba, AWS and OctoML.
+Detailed tutorials are available for Intel CPUs, ARM CPUs, NVIDIA CPUs, and Mali GPUs on the TVM website [1].
+In this blog post, we will give a high-level introduction and show some benchmark results.&lt;/p&gt;
+
+&lt;h1 id=&quot;system-overview&quot;&gt;System Overview&lt;/h1&gt;
+
+&lt;h2 id=&quot;autotvm-vs-auto-scheduler&quot;&gt;AutoTVM vs Auto-scheduler&lt;/h2&gt;
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/intro-auto-scheduler/workflow.png&quot; alt=&quot;image&quot; width=&quot;75%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Table 1. Workflow Comparision &lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;p&gt;Table 1 compares the workflow for generating code for an operator in AutoTVM and auto-scheduler.
+In AutoTVM, the developer has to go through three steps.
+In step 1, the developer has to write the compute definition in TVM’s tensor expression language.
+This part is relatively easy because TVM’s tensor expression language looks just like math expressions.
+In step 2, the developer has to write a schedule template, which typically consists of 20-100 lines of tricky DSL code.
+This part requires domain expertise of both the target hardware architecture and operator semantics, so it is difficult.
+The last step, step 3, is automated by a search algorithm.&lt;/p&gt;
+
+&lt;p&gt;In auto-scheduler, we eliminate the most difficult step 2 by automatic search space construction and accelerate step 3 with a better search algorithm.
+By doing automatic search space construction, we not only eliminate huge manual effort, 
+but also enabling the exploration of much more optimization combinations.
+This automation does not come for free, because we still need to design rules to generate the search space.
+However, these rules are very general. They are based on static analysis of the tensor expressions.
+We only need to design a few general rules once and can apply them to almost all tensor computations in deep learning.&lt;/p&gt;
+
+&lt;h2 id=&quot;search-process&quot;&gt;Search Process&lt;/h2&gt;
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/intro-auto-scheduler/search_overview.png&quot; alt=&quot;image&quot; width=&quot;40%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Figure 1. Search Process Overview  &lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 1. shows the search process of auto-scheduler when optimizing a whole neural network.
+The system takes deep learning models as input.
+It then partitions the big model into small subgraphs with Relay’s operator fusion pass.
+A task scheduler is utilized to allocate the time resource for optimizing many subgraphs.
+At each iteration, it picks a subgraph that has the most potential to increase the end-to-end performance.
+For this subgraph, we analyze its tensor expression and generate several sketches for it.
+Then we run evolutionary search with a learned cost model to get a batch of optimized programs.
+The optimized programs are sent to actual hardware for measurements.
+When the measurements are finished, the profiling results are used as feedback to update all components of the system.
+This process is repeated iteratively until the optimization converges or we run out of time budget.
+More technical details can be found in our paper [3] and our code.&lt;/p&gt;
+
+&lt;p&gt;It is worth notiing that since the auto-scheduler generates schedules from scratch, 
+it reuses the existing computation definitions in TOPI but not schedule templates.&lt;/p&gt;
+
+&lt;h1 id=&quot;benchmark-results&quot;&gt;Benchmark Results&lt;/h1&gt;
+&lt;p&gt;In this section, we benchmark the performance of AutoTVM and Auto-scheduler.
+The CPU benchmark is done on an AWS c5.9xlarge, which is equipped with an Intel 18-core skylake 8124-m CPU. 
+The GPU benchmark is done on an AWS g4dn.4xlarge, which is equipped with an NVIDIA T4 GPU.
+All benchmark code, raw data, tuning logs can be found in this repo [2].&lt;/p&gt;
+
+&lt;h2 id=&quot;performance-of-the-generated-code&quot;&gt;Performance of the generated code&lt;/h2&gt;
+&lt;p&gt;We benchmark the fp32 single-batch inference latency on three networks.
+Figure 2 shows the relative speedup of auto-scheduler against AutoTVM.
+We can see auto-scheduler outperforms AutoTVM in all cases with 1.02x to 8.95x speedup.
+This is because auto-scheduler explores a larger search space, which covers more efficient combinations
+of optimizations that are missed in TOPI manual templates.
+The BERT-base@GPU is an extreme case where the manual templates are very badly designed.
+In other words, the manual template for dense layers does not perform well for the shapes in BERT model.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/intro-auto-scheduler/code_perf.png&quot; alt=&quot;image&quot; width=&quot;85%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Figure 2. Code Performance Comparision (Higher is better) &lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;search-time&quot;&gt;Search Time&lt;/h2&gt;
+&lt;p&gt;As we know, the search-based approaches can be very time-consuming, so we also care about the search time.
+It typically takes several hours to let the search converge for a single neural network.
+Figure 3 compares the search time of AutoTVM and auto-scheduler.
+Auto-scheduler requires much less time to converge in most cases, despite its larger search space.
+This is mainly because of auto-scheduler has a better cost model and task scheduler.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/intro-auto-scheduler/search_time.png&quot; alt=&quot;image&quot; width=&quot;85%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Figure 3. Search Time Comparision (Lower is better) &lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;more-results&quot;&gt;More Results&lt;/h2&gt;
+&lt;p&gt;The repo above serves as an internal benchmark tool for TVM, so it only compares the latest AutoTVM and AutoScheduler.
+You can find results for more libraries and backends in our paper [3].
+Recently, this blog post [4] also tried auto-scheduler on an Apple M1 chip and got some good results.&lt;/p&gt;
+
+&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;
+&lt;p&gt;We build TVM auto-scheduler, a system that automatically generates high-performance code for tensor expressions.
+Compared with the predecessor AutoTVM, auto-scheduler does not require manual templates.
+Besides, auto-scheduler is capable of generating schedules with better performance in a shorter time.
+We achieve this by making innovations in the search space construction and search algorithm.&lt;/p&gt;
+
+&lt;p&gt;We are excited about the current performance of auto-scheduler.
+In the future, we are interested in extending the ability of auto-scheduler to support
+sparse operators, low-precision operators, and dynamic shape better.&lt;/p&gt;
+
+&lt;h1 id=&quot;links&quot;&gt;Links&lt;/h1&gt;
+&lt;p&gt;[1] Tutorials: &lt;a href=&quot;https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling&quot;&gt;https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling&lt;/a&gt;&lt;br /&gt;
+[2] Benchmark repo: &lt;a href=&quot;https://github.com/tlc-pack/TLCBench&quot;&gt;https://github.com/tlc-pack/TLCBench&lt;/a&gt;&lt;br /&gt;
+[3] OSDI Paper: &lt;a href=&quot;https://arxiv.org/abs/2006.06762&quot;&gt;Ansor : Generating High-Performance Tensor Programs for Deep Learning&lt;/a&gt;&lt;br /&gt;
+[4] Results on Apple M1 chip: &lt;a href=&quot;https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d&quot;&gt;https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d&lt;/a&gt;.&lt;/p&gt;
+
+</content>
+ </entry>
+ 
+ <entry>
    <title>Bring Your Own Datatypes: Enabling Custom Datatype Exploration in TVM</title>
    <link href="https://tvm.apache.org/2020/09/26/bring-your-own-datatypes"/>
-   <updated>2020-09-26T00:00:00-04:00</updated>
+   <updated>2020-09-26T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2020/09/26/bring-your-own-datatypes</id>
    <content type="html">&lt;p&gt;In this post, we describe the Bring Your Own Datatypes framework, which enables the use of custom datatypes within TVM.&lt;/p&gt;
 
@@ -308,7 +438,7 @@ For more documentation about the Bring Your Own Datatypes framework
  <entry>
    <title>How to Bring Your Own Codegen to TVM</title>
    <link href="https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm"/>
-   <updated>2020-07-15T00:00:00-04:00</updated>
+   <updated>2020-07-15T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm</id>
    <content type="html">&lt;p&gt;To free data scientists from worrying about the performance when developing a new model, hardware backend providers (e.g., Intel, NVIDIA, ARM, etc) either provide kernel libraries such as cuBLAS or cuDNN with many commonly used deep learning kernels, or provide frameworks such as DNNL or TensorRT with a graph engine to let users describe their models in a certain way to achieve high performance. In addition, emerging deep learning accelerators also have t [...]
 
@@ -787,7 +917,7 @@ Figure 4: After Graph Partitioning.
  <entry>
    <title>Bridging PyTorch and TVM</title>
    <link href="https://tvm.apache.org/2020/07/14/bert-pytorch-tvm"/>
-   <updated>2020-07-14T00:00:00-04:00</updated>
+   <updated>2020-07-14T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2020/07/14/bert-pytorch-tvm</id>
    <content type="html">
 &lt;p&gt;(A more code-heavy variant is crossposted on the more PyTorch affine &lt;a href=&quot;https://lernapparat.de/transformers-pytorch-tvm/&quot;&gt;Lernapparat&lt;/a&gt;,
@@ -1310,7 +1440,7 @@ He is a PyTorch core developer and co-authored &lt;a href=&quot;https://www.mann
  <entry>
    <title>TinyML - How TVM is Taming Tiny</title>
    <link href="https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny"/>
-   <updated>2020-06-04T00:00:00-04:00</updated>
+   <updated>2020-06-04T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny</id>
    <content type="html">
 &lt;p&gt;&lt;img src=&quot;/images/microtvm/logo.png&quot; alt=&quot;microTVM logo&quot; width=&quot;30%&quot; /&gt;&lt;br /&gt;&lt;/p&gt;
@@ -1619,7 +1749,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix multiplication microkernel&lt;/
  <entry>
    <title>Compiling Machine Learning to WASM and WebGPU with Apache TVM</title>
    <link href="https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu"/>
-   <updated>2020-05-14T00:00:00-04:00</updated>
+   <updated>2020-05-14T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu</id>
    <content type="html">&lt;p&gt;&lt;strong&gt;TLDR&lt;/strong&gt;&lt;/p&gt;
 
@@ -1706,7 +1836,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix multiplication microkernel&lt;/
  <entry>
    <title>Integrating TVM into PyTorch</title>
    <link href="https://tvm.apache.org/2019/05/30/pytorch-frontend"/>
-   <updated>2019-05-30T00:00:00-04:00</updated>
+   <updated>2019-05-30T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2019/05/30/pytorch-frontend</id>
    <content type="html">&lt;p&gt;As TVM continuously demonstrates improvements to the efficiency of deep learning execution,
 it has become clear that PyTorch stands to benefit from directly leveraging the compiler stack.
@@ -1808,7 +1938,7 @@ relay_graph = torch_tvm.to_relay(mul, inputs)
  <entry>
    <title>Automating Optimization of Quantized Deep Learning Models on CUDA</title>
    <link href="https://tvm.apache.org/2019/04/29/opt-cuda-quantized"/>
-   <updated>2019-04-29T12:00:00-04:00</updated>
+   <updated>2019-04-29T09:00:00-07:00</updated>
    <id>https://tvm.apache.org/2019/04/29/opt-cuda-quantized</id>
    <content type="html">&lt;p&gt;Deep learning has been successfully applied to a variety of tasks.
 On real-time scenarios such as inference on autonomous vehicles, the inference speed of the model is critical.
@@ -1952,7 +2082,7 @@ We show that automatic optimization in TVM makes it easy and flexible to support
  <entry>
    <title>TVM Deep Learning Compiler Joins Apache Software Foundation</title>
    <link href="https://tvm.apache.org/2019/03/18/tvm-apache-announcement"/>
-   <updated>2019-03-18T00:00:00-04:00</updated>
+   <updated>2019-03-18T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2019/03/18/tvm-apache-announcement</id>
    <content type="html">&lt;p&gt;There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms – such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) – requires significant manual effort.&lt;/p&gt;
 
@@ -1975,7 +2105,7 @@ We show that automatic optimization in TVM makes it easy and flexible to support
  <entry>
    <title>TVM Golang Runtime for Deep Learning Deployment</title>
    <link href="https://tvm.apache.org/2019/01/19/Golang"/>
-   <updated>2019-01-19T00:00:00-05:00</updated>
+   <updated>2019-01-19T00:00:00-08:00</updated>
    <id>https://tvm.apache.org/2019/01/19/Golang</id>
    <content type="html">&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;
 
@@ -2145,7 +2275,7 @@ closure as TVM packed function and invoke the same across programming language b
  <entry>
    <title>Automating Generation of Low Precision Deep Learning Operators</title>
    <link href="https://tvm.apache.org/2018/12/18/lowprecision-conv"/>
-   <updated>2018-12-18T00:00:00-05:00</updated>
+   <updated>2018-12-18T00:00:00-08:00</updated>
    <id>https://tvm.apache.org/2018/12/18/lowprecision-conv</id>
    <content type="html">&lt;p&gt;As deep learning models grow larger and more complex, deploying them on low powered phone and IoT
 devices becomes challenging because of their limited compute and energy budgets. A  recent  trend
@@ -2306,7 +2436,7 @@ Note: x86 doesn’t support a vectorized popcount for this microarchitecture, so
  <entry>
    <title>Efficient Privacy-Preserving ML Using TVM</title>
    <link href="https://tvm.apache.org/2018/10/09/ml-in-tees"/>
-   <updated>2018-10-09T00:00:00-04:00</updated>
+   <updated>2018-10-09T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2018/10/09/ml-in-tees</id>
    <content type="html">&lt;p&gt;This post describes Myelin, a framework for privacy-preserving machine learning in trusted hardware enclaves, and how TVM makes Myelin fast.
 The key idea is that TVM, unlike other popular ML frameworks, compiles models into lightweight, optimized, and dependency-free libraries which can fit into resource constrained enclaves.&lt;/p&gt;
@@ -2422,7 +2552,7 @@ His research interest is in the general domain of ML on shared private data, but
  <entry>
    <title>Automatic Kernel Optimization for Deep Learning on All Hardware Platforms</title>
    <link href="https://tvm.apache.org/2018/10/03/auto-opt-all"/>
-   <updated>2018-10-03T00:00:00-04:00</updated>
+   <updated>2018-10-03T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2018/10/03/auto-opt-all</id>
    <content type="html">&lt;p&gt;Optimizing the performance of deep neural network on a diverse range of hardware platforms is still a hard
 problem for AI developers. In terms of system support, we are facing a many-to-many problem here:
@@ -2816,7 +2946,7 @@ for inference deployment. TVM just provides such a solution.&lt;/p&gt;
  <entry>
    <title>Building a Cross-Framework Deep Learning Compiler via DLPack</title>
    <link href="https://tvm.apache.org/2018/08/10/DLPack-Bridge"/>
-   <updated>2018-08-10T00:00:00-04:00</updated>
+   <updated>2018-08-10T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2018/08/10/DLPack-Bridge</id>
    <content type="html">&lt;p&gt;Deep learning frameworks such as Tensorflow, PyTorch, and ApacheMxNet provide a
 powerful toolbox for quickly prototyping and deploying deep learning models.
@@ -2955,7 +3085,7 @@ support, and can be used to implement convenient converters, such as
  <entry>
    <title>VTA: An Open, Customizable Deep Learning Acceleration Stack </title>
    <link href="https://tvm.apache.org/2018/07/12/vta-release-announcement"/>
-   <updated>2018-07-12T00:00:00-04:00</updated>
+   <updated>2018-07-12T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2018/07/12/vta-release-announcement</id>
    <content type="html">&lt;p style=&quot;text-align: center&quot;&gt;Thierry Moreau(VTA architect), Tianqi Chen(TVM stack), Ziheng Jiang†(graph compilation), Luis Vega(cloud deployment)&lt;/p&gt;
 &lt;p style=&quot;text-align: center&quot;&gt;Advisors: Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy&lt;/p&gt;
@@ -3097,7 +3227,7 @@ This kind of high-level visibility is essential to system designers who want to
  <entry>
    <title>Bringing TVM into TensorFlow for Optimizing Neural Machine Translation on GPU</title>
    <link href="https://tvm.apache.org/2018/03/23/nmt-transformer-optimize"/>
-   <updated>2018-03-23T00:00:00-04:00</updated>
+   <updated>2018-03-23T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2018/03/23/nmt-transformer-optimize</id>
    <content type="html">&lt;h2 id=&quot;author&quot;&gt;Author&lt;/h2&gt;
 
@@ -3363,7 +3493,7 @@ C = tvm.compute(
  <entry>
    <title>Compiling Deep Learning Models to WebGL with TVM</title>
    <link href="https://tvm.apache.org/2018/03/12/webgl"/>
-   <updated>2018-03-12T00:00:00-04:00</updated>
+   <updated>2018-03-12T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2018/03/12/webgl</id>
    <content type="html">&lt;p&gt;Now TVM comes with a brand-new OpenGL/WebGL backend!
 This blog post explains what it is, and what you can achieve with it.&lt;/p&gt;
@@ -3479,7 +3609,7 @@ optimizations into the TVM stack.&lt;/p&gt;
  <entry>
    <title>Optimizing Mobile Deep Learning on ARM GPU with TVM</title>
    <link href="https://tvm.apache.org/2018/01/16/opt-mali-gpu"/>
-   <updated>2018-01-16T00:00:00-05:00</updated>
+   <updated>2018-01-16T00:00:00-08:00</updated>
    <id>https://tvm.apache.org/2018/01/16/opt-mali-gpu</id>
    <content type="html">&lt;p&gt;With the great success of deep learning, the demand for
 deploying deep neural networks to mobile devices is growing rapidly.
@@ -4053,7 +4183,7 @@ advice and &lt;a href=&quot;https://github.com/yzhliu&quot;&gt;Yizhi Liu&lt;/a&g
  <entry>
    <title>Remote Profile and Test Deep Learning Cross Compilation on Mobile Phones with TVM RPC</title>
    <link href="https://tvm.apache.org/2017/11/08/android-rpc-introduction"/>
-   <updated>2017-11-08T00:00:00-05:00</updated>
+   <updated>2017-11-08T00:00:00-08:00</updated>
    <id>https://tvm.apache.org/2017/11/08/android-rpc-introduction</id>
    <content type="html">&lt;p&gt;TVM stack is an end to end compilation stack to deploy deep learning workloads to all hardware backends.
 Thanks to the NNVM compiler support of TVM stack, we can now directly compile descriptions from deep learning frameworks and compile them to bare metal code.
@@ -4281,7 +4411,7 @@ make jvminstall
  <entry>
    <title>Bringing AMDGPUs to TVM Stack and NNVM Compiler with ROCm</title>
    <link href="https://tvm.apache.org/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm"/>
-   <updated>2017-10-30T00:00:00-04:00</updated>
+   <updated>2017-10-30T00:00:00-07:00</updated>
    <id>https://tvm.apache.org/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm</id>
    <content type="html">&lt;p style=&quot;text-align: center&quot;&gt;Aditya Atluri, Advanced Micro Devices, Inc.&lt;/p&gt;
 &lt;p style=&quot;text-align: center&quot;&gt;Masahiro Masuda, Ziosoft, Inc.&lt;/p&gt;
@@ -4504,88 +4634,5 @@ BB0_6:
 </content>
  </entry>
  
- <entry>
-   <title>NNVM Compiler: Open Compiler for AI Frameworks</title>
-   <link href="https://tvm.apache.org/2017/10/06/nnvm-compiler-announcement"/>
-   <updated>2017-10-06T11:30:00-04:00</updated>
-   <id>https://tvm.apache.org/2017/10/06/nnvm-compiler-announcement</id>
-   <content type="html">&lt;p style=&quot;text-align: center&quot;&gt;Paul G. Allen School of Computer Science &amp;amp; Engineering, University of Washington&lt;/p&gt;
-&lt;p style=&quot;text-align: center&quot;&gt;Amazon Web Service AI team&lt;/p&gt;
-&lt;p style=&quot;text-align: center&quot;&gt;DMLC open-source community&lt;/p&gt;
-
-&lt;p&gt;Deep learning has become ubiquitous and indispensable. We are seeing a rising need for deploying deep learning workloads on many kinds of platforms such as mobile phones, GPU, IoT devices and specialized accelerators.  Last month, we announced TVM stack to close the gap between deep learning frameworks, and the performance- or efficiency-oriented hardware backends.  TVM stack makes it easy to build an end to end compilation for a deep learning framework.  However, we think it wo [...]
-
-&lt;p&gt;Today, UW Allen school and AWS AI team, together with other contributors, are excited to announce the release of NNVM compiler, an open deep learning compiler to compile front-end framework workloads directly to hardware backends. We build it using the two-level intermediate representation(IR) in the TVM stack.
-The reader is welcome to refer to the &lt;a href=&quot;http://www.tvmlang.org/2017/08/17/tvm-release-announcement.html&quot;&gt;original TVM announcement&lt;/a&gt; for more technical details about TVM stack. With the help of TVM stack, NNVM compiler can:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;Represent and optimize the common deep learning workloads in high level graph IR&lt;/li&gt;
-  &lt;li&gt;Transform the computation graph to minimize memory utilization, optimize data layout and fuse computation patterns for different hardware backends.&lt;/li&gt;
-  &lt;li&gt;Present an end to end compilation pipeline from front-end deep learning frameworks to bare metal hardwares.&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/nnvm/nnvm_compiler_stack.png&quot; alt=&quot;image&quot; width=&quot;612px&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;The NNVM compiler can directly take models from deep learning frameworks such as Apache MXNet.
-It also support model exchange formats such as ONNX and CoreML. ONNX support enables NNVM to compile deep learning models from PyTorch, Caffe2 and CNTK.
-The CoreML frontend enables deployment of CoreML models to non-iOS devices.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/nnvm/nnvm_compiler_code.png&quot; alt=&quot;image&quot; width=&quot;712px&quot; /&gt;&lt;/p&gt;
-
-&lt;h2 id=&quot;separation-of-optimization-and-deployment&quot;&gt;Separation of Optimization and Deployment&lt;/h2&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/nnvm/nnvm_deploy.png&quot; alt=&quot;image&quot; width=&quot;512px&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;NNVM compiler applies graph level and tensor level optimizations and jointly optimize them to get the best performance. We take a different approach from existing deep learning frameworks, which packages the graph optimization with the deployment runtime.  NNVM compiler adopts the conventional wisdom from compiler to separate the optimization from the actual deployment runtime. This approach offers substantial optimization but still keeps the runtime lightweight. The compiled mo [...]
-
-&lt;h2 id=&quot;performance&quot;&gt;Performance&lt;/h2&gt;
-
-&lt;p&gt;NNVM compiler is still under active development, and we can expect more improvements to come, but we have started to see promising results.
-We benchmarked its performance and compared it against Apache MXNet on two typical hardware configurations: ARM CPU on Raspberry PI and Nvidia GPU on AWS. Despite the radical architecture difference between these two chips, we can use the same infrastructure and only need to change the schedule for each type of hardware.&lt;/p&gt;
-
-&lt;h3 id=&quot;nvidia-gpu&quot;&gt;Nvidia GPU&lt;/h3&gt;
-
-&lt;p&gt;GPU benchmarks and schedules are contributed by Leyuan Wang (AWS/UCDavis) and Yuwei Hu (TuSimple). We compared the NNVM compiler against Apache MXNet with CUDA8 and cuDNN7 as the backend on Nvidia K80. This is a very strong baseline, as Apache MXNet turns on auto-tuning to select the best kernel from CuDNN. We also used the optimized depthwise kernel in MXNet to optimize MobileNet workload.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/nnvm/nnvm_k80_result.png&quot; alt=&quot;image&quot; width=&quot;400px&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;As can be seen, NNVM compiler generate code that outperforms Apache MXNet on K80. These improvements are due to the joint graph level and kernel level optimizations. It is worth noting that NNVM compiler generates all the optimized GPU kernels on its own without relying on external libraries like CuDNN.&lt;/p&gt;
-
-&lt;h3 id=&quot;raspberry-pi-3b&quot;&gt;Raspberry Pi 3b&lt;/h3&gt;
-
-&lt;p&gt;The Rasberry Pi compilation stack is contributed by Ziheng Jiang(AWS/FDU).
-We compared NNVM compiler against Apache MXNet with OpenBLAS and NNPack.
-We explored the setups to get the best performance out of MXNet: we turned on Winograd convolution in the NNPACK for 3x3 convolutions, enabled multi-threading and disabled the additional scheduler thread (so all threads are used by NNPack).&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/nnvm/nnvm_rasp_result.png&quot; alt=&quot;image&quot; width=&quot;400px&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;As can be seen, the code generated by NNVM compiler is two times faster on ResNet18.
-The gap on MobileNet is mainly due to lack of depthwise convolution in existing CPU DNN libraries. NNVM compiler takes benefit of direct generating efficient ARM code directly.&lt;/p&gt;
-
-&lt;h2 id=&quot;acknowledgement&quot;&gt;Acknowledgement&lt;/h2&gt;
-&lt;p&gt;This project wouldn’t become possible without our early contributors in the DMLC community.
-We would like to specially thank Yuwei Hu(TuSimple), Leyuan Wang(AWS/UCDavis), Joshua Z. Zhang(AWS)
-and Xingjian Shi(HKUST) for their early contributions to the project. We would also like to thank all the contributors
-to the TVM stack.&lt;/p&gt;
-
-&lt;p&gt;We also learnt a lot from the following projects when building NNVM Compiler.&lt;/p&gt;
-&lt;ul&gt;
-  &lt;li&gt;&lt;a href=&quot;https://github.com/Theano/Theano&quot;&gt;Theano&lt;/a&gt;: possibly the earliest compiler for deep learning&lt;/li&gt;
-  &lt;li&gt;&lt;a href=&quot;https://github.com/halide/Halide&quot;&gt;Halide&lt;/a&gt;: TVM uses &lt;a href=&quot;https://github.com/dmlc/HalideIR&quot;&gt;HalideIR&lt;/a&gt; as data structure for
-arithematic simplification and low level lowering. HalideIR is derived from Halide.
-We also learns from Halide when implementing the lowering pipeline in TVM.&lt;/li&gt;
-  &lt;li&gt;&lt;a href=&quot;https://github.com/inducer/loopy&quot;&gt;Loopy&lt;/a&gt;: use of integer set analysis and its loop transformation primitives.&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;h2 id=&quot;links&quot;&gt;Links&lt;/h2&gt;
-&lt;ul&gt;
-  &lt;li&gt;Github page of NNVM Compiler: &lt;a href=&quot;https://github.com/dmlc/nnvm&quot;&gt;https://github.com/dmlc/nnvm&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;Github page of TVM: &lt;a href=&quot;https://github.com/dmlc/tvm&quot;&gt;https://github.com/dmlc/tvm&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;&lt;a href=&quot;https://news.cs.washington.edu/2017/10/06/allen-school-and-aws-team-up-on-new-nnvm-compiler-for-deep-learning-frameworks/&quot;&gt;UW Allen school blog about NNVM compiler&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;&lt;a href=&quot;https://aws.amazon.com/blogs/ai/introducing-nnvm-compiler-a-new-open-end-to-end-compiler-for-ai-frameworks/&quot;&gt;AWS blogpost about NNVM compiler&lt;/a&gt;&lt;/li&gt;
-&lt;/ul&gt;
-</content>
- </entry>
- 
  
 </feed>
diff --git a/blog.html b/blog.html
index 8dbab07..ae12173 100644
--- a/blog.html
+++ b/blog.html
@@ -146,6 +146,16 @@
 
 <li>
   <span>
+    <a class="post-link" href="/2021/03/03/intro-auto-scheduler">Introducing TVM Auto-scheduler (a.k.a. Ansor)</a>
+  </span>
+  </br>
+  <span>
+    Mar 3, 2021
+  </span>
+</li>
+
+<li>
+  <span>
     <a class="post-link" href="/2020/09/26/bring-your-own-datatypes">Bring Your Own Datatypes: Enabling Custom Datatype Exploration in TVM</a>
   </span>
   </br>
diff --git a/community.html b/community.html
index 365bb79..d3e347f 100644
--- a/community.html
+++ b/community.html
@@ -279,6 +279,10 @@ This is a community maintained list of organizations using and contributing to t
     </li>
     
     <li>
+        <img src="/images/community/sjtu.png" />
+    </li>
+    
+    <li>
         <img src="/images/community/ucberkeley.png" />
     </li>
     
diff --git a/feed.xml b/feed.xml
index a3d90e2..5d387ea 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,124 @@
-<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.1.1">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2021-01-04T16:22:52-05:00</updated><id>/feed.xml</id><title type="html">TVM</title><author><name>{&quot;name&quot;=&gt;nil}</name></author><entry><title type="html">Bring Your Own Datatypes: Enabling Custom Datatype [...]
+<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.1.1">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2021-03-03T01:20:46-08:00</updated><id>/feed.xml</id><title type="html">TVM</title><author><name>{&quot;name&quot;=&gt;nil}</name></author><entry><title type="html">Introducing TVM Auto-scheduler (a.k.a. Ansor)</tit [...]
+model size, operator diversity, and hardware heterogeneity.
+From a computational perspective, deep neural networks are just layers and layers of tensor computations.
+These tensor computations, such as matmul and conv2d, can be easily described by mathematical expressions.
+However, providing high-performance implementations for them on modern hardware can be very challenging.
+We have to apply various low-level optimizations and utilize special hardware intrinsics to achieve high performance.
+It takes huge engineering effort to build linear algebra and neural network acceleration libraries like CuBLAS, CuDNN, oneMKL, and oneDNN.&lt;/p&gt;
+
+&lt;p&gt;Our life will be much easier if we can just write mathematical expressions and have something
+magically turn them into efficient code implementations.
+Three years ago, deep learning compiler TVM and its search module AutoTVM were built as the first step towards this goal.
+AutoTVM employs a template-based search algorithm to find efficient implementations for a given tensor computation.
+However, it is a template-based approach, so it still requires domain experts to implement a non-trivial manual template
+for every operator on every platform.
+Today, there are more than 15k lines of code for these templates in the TVM code repository.
+Besides being very hard to develop, these templates often have inefficient and limited search spaces,
+making them unable to achieve optimal performance.&lt;/p&gt;
+
+&lt;p&gt;To address the limitations of AutoTVM, we started project Ansor aiming at a fully automated auto-scheduler for 
+generating code for tensor computations.
+Ansor auto-scheduler only takes tensor expressions as input and generates high-performance code without manual templates.
+We made innovations in the search space construction and search algorithm.
+As a result, the auto-scheduler can achieve better performance with less search time in a more automated way.&lt;/p&gt;
+
+&lt;p&gt;Ansor auto-scheduler is now integrated into Apache TVM as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tvm.auto_scheduler&lt;/code&gt; package.
+This is a joint effort by collaborators from UC Berkeley, Alibaba, AWS and OctoML.
+Detailed tutorials are available for Intel CPUs, ARM CPUs, NVIDIA CPUs, and Mali GPUs on the TVM website [1].
+In this blog post, we will give a high-level introduction and show some benchmark results.&lt;/p&gt;
+
+&lt;h1 id=&quot;system-overview&quot;&gt;System Overview&lt;/h1&gt;
+
+&lt;h2 id=&quot;autotvm-vs-auto-scheduler&quot;&gt;AutoTVM vs Auto-scheduler&lt;/h2&gt;
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/intro-auto-scheduler/workflow.png&quot; alt=&quot;image&quot; width=&quot;75%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Table 1. Workflow Comparision &lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;p&gt;Table 1 compares the workflow for generating code for an operator in AutoTVM and auto-scheduler.
+In AutoTVM, the developer has to go through three steps.
+In step 1, the developer has to write the compute definition in TVM’s tensor expression language.
+This part is relatively easy because TVM’s tensor expression language looks just like math expressions.
+In step 2, the developer has to write a schedule template, which typically consists of 20-100 lines of tricky DSL code.
+This part requires domain expertise of both the target hardware architecture and operator semantics, so it is difficult.
+The last step, step 3, is automated by a search algorithm.&lt;/p&gt;
+
+&lt;p&gt;In auto-scheduler, we eliminate the most difficult step 2 by automatic search space construction and accelerate step 3 with a better search algorithm.
+By doing automatic search space construction, we not only eliminate huge manual effort, 
+but also enabling the exploration of much more optimization combinations.
+This automation does not come for free, because we still need to design rules to generate the search space.
+However, these rules are very general. They are based on static analysis of the tensor expressions.
+We only need to design a few general rules once and can apply them to almost all tensor computations in deep learning.&lt;/p&gt;
+
+&lt;h2 id=&quot;search-process&quot;&gt;Search Process&lt;/h2&gt;
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/intro-auto-scheduler/search_overview.png&quot; alt=&quot;image&quot; width=&quot;40%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Figure 1. Search Process Overview  &lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 1. shows the search process of auto-scheduler when optimizing a whole neural network.
+The system takes deep learning models as input.
+It then partitions the big model into small subgraphs with Relay’s operator fusion pass.
+A task scheduler is utilized to allocate the time resource for optimizing many subgraphs.
+At each iteration, it picks a subgraph that has the most potential to increase the end-to-end performance.
+For this subgraph, we analyze its tensor expression and generate several sketches for it.
+Then we run evolutionary search with a learned cost model to get a batch of optimized programs.
+The optimized programs are sent to actual hardware for measurements.
+When the measurements are finished, the profiling results are used as feedback to update all components of the system.
+This process is repeated iteratively until the optimization converges or we run out of time budget.
+More technical details can be found in our paper [3] and our code.&lt;/p&gt;
+
+&lt;p&gt;It is worth notiing that since the auto-scheduler generates schedules from scratch, 
+it reuses the existing computation definitions in TOPI but not schedule templates.&lt;/p&gt;
+
+&lt;h1 id=&quot;benchmark-results&quot;&gt;Benchmark Results&lt;/h1&gt;
+&lt;p&gt;In this section, we benchmark the performance of AutoTVM and Auto-scheduler.
+The CPU benchmark is done on an AWS c5.9xlarge, which is equipped with an Intel 18-core skylake 8124-m CPU. 
+The GPU benchmark is done on an AWS g4dn.4xlarge, which is equipped with an NVIDIA T4 GPU.
+All benchmark code, raw data, tuning logs can be found in this repo [2].&lt;/p&gt;
+
+&lt;h2 id=&quot;performance-of-the-generated-code&quot;&gt;Performance of the generated code&lt;/h2&gt;
+&lt;p&gt;We benchmark the fp32 single-batch inference latency on three networks.
+Figure 2 shows the relative speedup of auto-scheduler against AutoTVM.
+We can see auto-scheduler outperforms AutoTVM in all cases with 1.02x to 8.95x speedup.
+This is because auto-scheduler explores a larger search space, which covers more efficient combinations
+of optimizations that are missed in TOPI manual templates.
+The BERT-base@GPU is an extreme case where the manual templates are very badly designed.
+In other words, the manual template for dense layers does not perform well for the shapes in BERT model.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/intro-auto-scheduler/code_perf.png&quot; alt=&quot;image&quot; width=&quot;85%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Figure 2. Code Performance Comparision (Higher is better) &lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;search-time&quot;&gt;Search Time&lt;/h2&gt;
+&lt;p&gt;As we know, the search-based approaches can be very time-consuming, so we also care about the search time.
+It typically takes several hours to let the search converge for a single neural network.
+Figure 3 compares the search time of AutoTVM and auto-scheduler.
+Auto-scheduler requires much less time to converge in most cases, despite its larger search space.
+This is mainly because of auto-scheduler has a better cost model and task scheduler.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/intro-auto-scheduler/search_time.png&quot; alt=&quot;image&quot; width=&quot;85%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Figure 3. Search Time Comparision (Lower is better) &lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;more-results&quot;&gt;More Results&lt;/h2&gt;
+&lt;p&gt;The repo above serves as an internal benchmark tool for TVM, so it only compares the latest AutoTVM and AutoScheduler.
+You can find results for more libraries and backends in our paper [3].
+Recently, this blog post [4] also tried auto-scheduler on an Apple M1 chip and got some good results.&lt;/p&gt;
+
+&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;
+&lt;p&gt;We build TVM auto-scheduler, a system that automatically generates high-performance code for tensor expressions.
+Compared with the predecessor AutoTVM, auto-scheduler does not require manual templates.
+Besides, auto-scheduler is capable of generating schedules with better performance in a shorter time.
+We achieve this by making innovations in the search space construction and search algorithm.&lt;/p&gt;
+
+&lt;p&gt;We are excited about the current performance of auto-scheduler.
+In the future, we are interested in extending the ability of auto-scheduler to support
+sparse operators, low-precision operators, and dynamic shape better.&lt;/p&gt;
+
+&lt;h1 id=&quot;links&quot;&gt;Links&lt;/h1&gt;
+&lt;p&gt;[1] Tutorials: &lt;a href=&quot;https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling&quot;&gt;https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling&lt;/a&gt;&lt;br /&gt;
+[2] Benchmark repo: &lt;a href=&quot;https://github.com/tlc-pack/TLCBench&quot;&gt;https://github.com/tlc-pack/TLCBench&lt;/a&gt;&lt;br /&gt;
+[3] OSDI Paper: &lt;a href=&quot;https://arxiv.org/abs/2006.06762&quot;&gt;Ansor : Generating High-Performance Tensor Programs for Deep Learning&lt;/a&gt;&lt;br /&gt;
+[4] Results on Apple M1 chip: &lt;a href=&quot;https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d&quot;&gt;https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d&lt;/a&gt;.&lt;/p&gt;</content><author><name>Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu</name></author><summary type="html">Optimizing the execution speed of deep neural networks i [...]
 
 &lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;
 
@@ -282,7 +402,7 @@ For more documentation about the Bring Your Own Datatypes framework
       &lt;p&gt;&lt;a href=&quot;https://posithub.org/docs/BeatingFloatingPoint.pdf&quot; target=&quot;_blank&quot;&gt;Beating Floating Point at its Own Game: Posit Arithmetic&lt;/a&gt; &lt;a href=&quot;#fnref:posit&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
     &lt;/li&gt;
   &lt;/ol&gt;
-&lt;/div&gt;</content><author><name>Gus Smith, Andrew Liu</name></author><summary type="html">In this post, we describe the Bring Your Own Datatypes framework, which enables the use of custom datatypes within TVM.</summary></entry><entry><title type="html">How to Bring Your Own Codegen to TVM</title><link href="/2020/07/15/how-to-bring-your-own-codegen-to-tvm" rel="alternate" type="text/html" title="How to Bring Your Own Codegen to TVM" /><published>2020-07-15T00:00:00-04:00</published>< [...]
+&lt;/div&gt;</content><author><name>Gus Smith, Andrew Liu</name></author><summary type="html">In this post, we describe the Bring Your Own Datatypes framework, which enables the use of custom datatypes within TVM.</summary></entry><entry><title type="html">How to Bring Your Own Codegen to TVM</title><link href="/2020/07/15/how-to-bring-your-own-codegen-to-tvm" rel="alternate" type="text/html" title="How to Bring Your Own Codegen to TVM" /><published>2020-07-15T00:00:00-07:00</published>< [...]
 
 &lt;p&gt;However, users have to learn a new programming interface when they attempt to work on a new kernel library or a device. As a result, the demand for a unified programming interface becomes more and more important to let all users and hardware backend providers stand on the same page.&lt;/p&gt;
 
@@ -751,7 +871,7 @@ Figure 4: After Graph Partitioning.
 
 &lt;h2 id=&quot;acknowledgment&quot;&gt;Acknowledgment&lt;/h2&gt;
 
-&lt;p&gt;We would like to thank our colleague Animesh Jain for valuable discussions in the framework design; Tianqi Chen and Jared Roesch from OctoML for system design discussions and prototyping; Masahiro Masuda from the TVM community to help code review and improve the DNNL integration. We would also like to thank Ramana Radhakrishnan, Matthew Barrett, Manupa Karunaratne, and Luke Hutton from ARM, U.K. for contributing several helpful ideas, related Relay passes, and the Arm Compute Li [...]
+&lt;p&gt;We would like to thank our colleague Animesh Jain for valuable discussions in the framework design; Tianqi Chen and Jared Roesch from OctoML for system design discussions and prototyping; Masahiro Masuda from the TVM community to help code review and improve the DNNL integration. We would also like to thank Ramana Radhakrishnan, Matthew Barrett, Manupa Karunaratne, and Luke Hutton from ARM, U.K. for contributing several helpful ideas, related Relay passes, and the Arm Compute Li [...]
  the Jupyter Notebook to follow along is on &lt;a href=&quot;https://github.com/t-vi/pytorch-tvmisc/tree/master/transformers-pytorch-tvm/&quot;&gt;github&lt;/a&gt;.)&lt;/p&gt;
 
 &lt;p&gt;Some of the most intriguing applications of Artificial Intelligence have been in Natural Language Processing.
@@ -1264,7 +1384,7 @@ one would want to re-do cheap computation, most prominently point-wise computati
 &lt;h1 id=&quot;author&quot;&gt;Author&lt;/h1&gt;
 
 &lt;p&gt;&lt;a href=&quot;https://lernapparat.de/&quot;&gt;Thomas Viehmann&lt;/a&gt; is the founder of &lt;a href=&quot;https://mathinf.eu/&quot;&gt;MathInf GmbH&lt;/a&gt;, Munich, Germany, a boutique training and consultancy firm focusing on Machine Learning and PyTorch.
-He is a PyTorch core developer and co-authored &lt;a href=&quot;https://www.manning.com/books/deep-learning-with-pytorch&quot;&gt;Deep Learning with PyTorch&lt;/a&gt;, which currently available as &lt;a href=&quot;https://pytorch.org/deep-learning-with-pytorch&quot;&gt;free download from the PyTorch website&lt;/a&gt;.&lt;/p&gt;</content><author><name>Thomas Viehmann, MathInf GmbH</name></author><summary type="html"></summary></entry><entry><title type="html">TinyML - How TVM is Taming Ti [...]
+He is a PyTorch core developer and co-authored &lt;a href=&quot;https://www.manning.com/books/deep-learning-with-pytorch&quot;&gt;Deep Learning with PyTorch&lt;/a&gt;, which currently available as &lt;a href=&quot;https://pytorch.org/deep-learning-with-pytorch&quot;&gt;free download from the PyTorch website&lt;/a&gt;.&lt;/p&gt;</content><author><name>Thomas Viehmann, MathInf GmbH</name></author><summary type="html"></summary></entry><entry><title type="html">TinyML - How TVM is Taming Ti [...]
 
 &lt;p&gt;The proliferation of low-cost, AI-powered consumer devices has led to widespread interest in “bare-metal” (low-power, often without an operating system) devices among ML researchers and practitioners.  While it is already possible for experts to run &lt;em&gt;some&lt;/em&gt; models on &lt;em&gt;some&lt;/em&gt; bare-metal devices, optimizing models for diverse sets of devices is challenging, often requiring manually optimized device-specific libraries.  And for those platforms wi [...]
 
@@ -1563,7 +1683,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix multiplication microkernel&lt;/
   &lt;li&gt;&lt;a href=&quot;https://homes.cs.washington.edu/~moreau/&quot;&gt;Thierry Moreau&lt;/a&gt;, for mentoring me during my time at OctoML.&lt;/li&gt;
   &lt;li&gt;&lt;a href=&quot;https://homes.cs.washington.edu/~vegaluis/&quot;&gt;Luis Vega&lt;/a&gt;, for teaching me the fundamentals of interacting with microcontrollers.&lt;/li&gt;
   &lt;li&gt;&lt;a href=&quot;https://www.linkedin.com/in/themadrasi/?originalSubdomain=uk&quot;&gt;Ramana Radhakrishnan&lt;/a&gt;, for supplying the Arm hardware used in our experiments and for providing guidance on its usage.&lt;/li&gt;
-&lt;/ul&gt;</content><author><name>Logan Weber and Andrew Reusch, OctoML</name></author><summary type="html"></summary></entry><entry><title type="html">Compiling Machine Learning to WASM and WebGPU with Apache TVM</title><link href="/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu" rel="alternate" type="text/html" title="Compiling Machine Learning to WASM and WebGPU with Apache TVM" /><published>2020-05-14T00:00:00-04:00</published><updated>2020-05-14T00:00:00-04:00</upd [...]
+&lt;/ul&gt;</content><author><name>Logan Weber and Andrew Reusch, OctoML</name></author><summary type="html"></summary></entry><entry><title type="html">Compiling Machine Learning to WASM and WebGPU with Apache TVM</title><link href="/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu" rel="alternate" type="text/html" title="Compiling Machine Learning to WASM and WebGPU with Apache TVM" /><published>2020-05-14T00:00:00-07:00</published><updated>2020-05-14T00:00:00-07:00</upd [...]
 
 &lt;p&gt;We introduced support for WASM and WebGPU to the Apache TVM deep learning compiler. Our experiments shows that  TVM’s WebGPU backend can get &lt;strong&gt;close to native&lt;/strong&gt; &lt;strong&gt;GPU performance&lt;/strong&gt; when deploying models to the web.&lt;/p&gt;
 
@@ -1641,7 +1761,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix multiplication microkernel&lt;/
 
 &lt;h2 id=&quot;acknowledgement&quot;&gt;Acknowledgement&lt;/h2&gt;
 
-&lt;p&gt;We would like to thank the emscripten project for providing the WASM compilation infrastructures as well as the JS library support on the web. We would also like to thank the WebGPU community for various helpful discussions. Thanks to Fletcher Haynes for valuable feedbacks to the post.&lt;/p&gt;</content><author><name>Tianqi Chen and Jared Roesch, OctoML</name></author><summary type="html">TLDR</summary></entry><entry><title type="html">Integrating TVM into PyTorch</title><link  [...]
+&lt;p&gt;We would like to thank the emscripten project for providing the WASM compilation infrastructures as well as the JS library support on the web. We would also like to thank the WebGPU community for various helpful discussions. Thanks to Fletcher Haynes for valuable feedbacks to the post.&lt;/p&gt;</content><author><name>Tianqi Chen and Jared Roesch, OctoML</name></author><summary type="html">TLDR</summary></entry><entry><title type="html">Integrating TVM into PyTorch</title><link  [...]
 it has become clear that PyTorch stands to benefit from directly leveraging the compiler stack.
 A major tenet of PyTorch is providing seamless and robust integrations that don’t get in the user’s way.
 To that end, PyTorch now has an official TVM-based backend, &lt;a href=&quot;https://github.com/pytorch/tvm&quot;&gt;torch_tvm&lt;/a&gt;.&lt;/p&gt;
@@ -1733,7 +1853,7 @@ def mul(a, b, c):
 
 # via script
 relay_graph = torch_tvm.to_relay(mul, inputs)
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;</content><author><name>Bram Wasti</name></author><summary type="html">As TVM continuously demonstrates improvements to the efficiency of deep learning execution, it has become clear that PyTorch stands to benefit from directly leveraging the compiler stack. A major tenet of PyTorch is providing seamless and robust integrations that don’t get in the user’s way. To that end, PyTorch now has an official TVM-based backend, torch_tvm.</summary [...]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;</content><author><name>Bram Wasti</name></author><summary type="html">As TVM continuously demonstrates improvements to the efficiency of deep learning execution, it has become clear that PyTorch stands to benefit from directly leveraging the compiler stack. A major tenet of PyTorch is providing seamless and robust integrations that don’t get in the user’s way. To that end, PyTorch now has an official TVM-based backend, torch_tvm.</summary [...]
 On real-time scenarios such as inference on autonomous vehicles, the inference speed of the model is critical.
 Network quantization is an effective approach to accelerating deep learning models.
 In quantized models, both data and model parameters are represented with low precision data types such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int8&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;float16&lt;/code&gt;.
@@ -1868,7 +1988,7 @@ We show that automatic optimization in TVM makes it easy and flexible to support
 &lt;/ul&gt;
 
 &lt;h1 id=&quot;bio--acknowledgement&quot;&gt;Bio &amp;amp; Acknowledgement&lt;/h1&gt;
-&lt;p&gt;&lt;a href=&quot;https://wuwei.io/&quot;&gt;Wuwei Lin&lt;/a&gt; is an undergraduate student at SJTU. He is currently an intern at TuSimple. The author has many thanks to &lt;a href=&quot;https://homes.cs.washington.edu/~tqchen/&quot;&gt;Tianqi Chen&lt;/a&gt; and &lt;a href=&quot;https://homes.cs.washington.edu/~eqy/&quot;&gt;Eddie Yan&lt;/a&gt; for their reviews.&lt;/p&gt;</content><author><name>Wuwei Lin</name></author><summary type="html">Deep learning has been successfully ap [...]
+&lt;p&gt;&lt;a href=&quot;https://wuwei.io/&quot;&gt;Wuwei Lin&lt;/a&gt; is an undergraduate student at SJTU. He is currently an intern at TuSimple. The author has many thanks to &lt;a href=&quot;https://homes.cs.washington.edu/~tqchen/&quot;&gt;Tianqi Chen&lt;/a&gt; and &lt;a href=&quot;https://homes.cs.washington.edu/~eqy/&quot;&gt;Eddie Yan&lt;/a&gt; for their reviews.&lt;/p&gt;</content><author><name>Wuwei Lin</name></author><summary type="html">Deep learning has been successfully ap [...]
 
 &lt;p&gt;TVM is an open source deep learning compiler stack that closes the gap between the productivity-focused deep learning frameworks, and the performance- or efficiency-oriented hardware backends. Today, we are glad to announce that the TVM community has decided to move on to Apache incubator, and becomes an Apache(incubating) project.&lt;/p&gt;
 
@@ -1882,7 +2002,7 @@ We show that automatic optimization in TVM makes it easy and flexible to support
 
 &lt;p&gt;We would like to take this chance to thank the Allen School for supporting the SAMPL team that gave birth to the TVM project. We would also like to thank the Halide project which provided the basis for TVM’s loop-level IR and initial code generation. We would like to thank our Apache incubator mentors for introducing the project to Apache and providing useful guidance. Finally, we would like to thank the TVM community and all of the organizations, as listed above, that supported [...]
 
-&lt;p&gt;See also the &lt;a href=&quot;https://news.cs.washington.edu/2019/03/18/allen-schools-tvm-deep-learning-compiler-framework-transitions-to-apache/&quot;&gt;Allen School news about the transition here&lt;/a&gt;, &lt;a href=&quot;https://sampl.cs.washington.edu/tvmconf/#about-tvmconf&quot;&gt;TVM conference program slides and recordings&lt;/a&gt;, and &lt;a href=&quot;https://tvm.apache.org/docs//contribute/community.html&quot;&gt;our community guideline here&lt;/a&gt;. Follow us o [...]
+&lt;p&gt;See also the &lt;a href=&quot;https://news.cs.washington.edu/2019/03/18/allen-schools-tvm-deep-learning-compiler-framework-transitions-to-apache/&quot;&gt;Allen School news about the transition here&lt;/a&gt;, &lt;a href=&quot;https://sampl.cs.washington.edu/tvmconf/#about-tvmconf&quot;&gt;TVM conference program slides and recordings&lt;/a&gt;, and &lt;a href=&quot;https://tvm.apache.org/docs//contribute/community.html&quot;&gt;our community guideline here&lt;/a&gt;. Follow us o [...]
 
 &lt;p&gt;TVM is an open deep learning compiler stack to compile various deep learning models from different
 frameworks to CPU, GPU or specialized accelerators.  TVM supports model compilation from a wide range
@@ -2043,155 +2163,4 @@ closure as TVM packed function and invoke the same across programming language b
   &lt;li&gt;[5] &lt;a href=&quot;https://blog.learngoprogramming.com/golang-variadic-funcs-how-to-patterns-369408f19085&quot;&gt;Go Variadic Functions&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;[6] &lt;a href=&quot;https://github.com/jdeng/gomxnet&quot;&gt;CFFI Ref&lt;/a&gt;&lt;/li&gt;
   &lt;li&gt;[7] &lt;a href=&quot;https://golang.org/pkg/runtime/#SetFinalizer&quot;&gt;Go Finalizers&lt;/a&gt;&lt;/li&gt;
-&lt;/ul&gt;</content><author><name>Siva</name></author><summary type="html">Introduction</summary></entry><entry><title type="html">Automating Generation of Low Precision Deep Learning Operators</title><link href="/2018/12/18/lowprecision-conv" rel="alternate" type="text/html" title="Automating Generation of Low Precision Deep Learning Operators" /><published>2018-12-18T00:00:00-05:00</published><updated>2018-12-18T00:00:00-05:00</updated><id>/2018/12/18/lowprecision-conv</id><content ty [...]
-devices becomes challenging because of their limited compute and energy budgets. A  recent  trend
- in  deep  learning  is  the  use  of  extremely  quantized  models  that operate  on  inputs  and
- weights  of  a  few  bits, with networks like XNOR-Net, DoReFa-Net, and HWGQ-Net making steady
-progress improving accuracy.&lt;/p&gt;
-
-&lt;p&gt;An example of a low precision graph snippet is below. The low precision convolution takes in
-quantized data and bitpacks into the proper data layout for an efficient bitserial convolution.
-The output is in a higher precision and traditional deep learning layers such as batch normalization and ReLu are applied to it, before being re-quantized and sent through another low precision operator.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/low-precision/workflow.png&quot; alt=&quot;image&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
-&lt;center&gt; Low precision convolution pipeline.&lt;/center&gt;
-&lt;p&gt;&lt;/p&gt;
-
-&lt;p&gt;Theoretically,  low  precision operators use less operations than
-floating point operators, leading many to believe they can achieve up tremendous speedups.
-However, deep  learning frameworks  leverage  decades  of  engineering  work  through  low  level
-BLAS  and LAPACK libraries that are incredibly well optimized, and CPUs include intrinsic
-instructions to accelerate these tasks.  In  practice,  it  is  not  simple  to  develop low-level
-operators such as convolutions  that  are competitive  with  8-bit  quantized  or  even floating
-point operators.
-In  this  post  we  introduce  our  approach to automatically generating optimized
-low  precision  convolutions for  CPUs. We declare our low precision operators so that they compute
-on efficiently stored low precision inputs, and describe a schedule that describes a search space
-of implementation parameters. We rely on AutoTVM to quickly search the space and find optimized
-parameters for the particular convolution, precision, and backend.&lt;/p&gt;
-
-&lt;h2 id=&quot;bitserial-computation-background&quot;&gt;Bitserial Computation Background&lt;/h2&gt;
-
-&lt;p&gt;The  core  of  low  precision  models  is  the bitserial dot product that enables convolution and
-dense operators to be computed using only bitwise operations and popcount.
- Typically, a dot product is computed by element wise multiplication of two vectors followed by
- summing all the elements, like the simple example below. If all the data is binary, the input
- vectors can be packed into single integer, and the dot product can be computed by  bitwise-anding
- the packed inputs and counting the number of 1’s in the result using popcount.
-Note: Depending how the input data is quantized, bitwise-xnor may be used instead of bitwise-and.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/low-precision/binary-dotproduct.png&quot; alt=&quot;image&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
-&lt;center&gt; Binary dot product.&lt;/center&gt;
-&lt;p&gt;&lt;/p&gt;
-
-&lt;p&gt;Arbitrary precision dot products can be computed in this fashion by first separating input data
-into bitplanes. Once in this representation we can compute dotproduct by summing weighted binary
-dot products between the bitplanes of A and B. The number of binary dotproducts grows with the
-product of A and B’s precision, so this method is only practical for very low precision data.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/low-precision/bitserial-dotproduct.png&quot; alt=&quot;image&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
-&lt;center&gt; Bitserial dot product.&lt;/center&gt;
-&lt;p&gt;&lt;/p&gt;
-
-&lt;h2 id=&quot;defining-operators-in-tvm&quot;&gt;Defining Operators in TVM&lt;/h2&gt;
-&lt;p&gt;Before the computation, input data needs to be bitpacked so that the bitplanes of the input data
-can be accessed and are packed into a supported datatype such as a uint8 or uint32. We provide
-a flexible bitpacking operator that takes arbitrary size input tensors and returns a bitpacked
-tensor where the user specifies which axis the bitplanes should be.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/low-precision/bitpack.png&quot; alt=&quot;image&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
-&lt;center&gt; Different bitpacked layouts.&lt;/center&gt;
-&lt;p&gt;&lt;/p&gt;
-
-&lt;p&gt;Once in this bitpacked format the low precision  convolution can be computed bitserially.
-For this demo, that data is packed along the input channel and the bitplanes are added to the
-innermost axis, and the data is packed into 32-bit integers. The bitserial convolution is computed
-similar to a normal convolution, but the bitwise-and (&amp;amp;) replaces multiplication, and we use
-popcount to accumulate values in the packed data. The bitplane axes become additional reduction axes
-and compute the binary dot products between different bitplanes of the input and kernel.
-Finally, the output is computed in an unpacked format and in higher precision.&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;Input_bitpacked&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bitpack&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;acti [...]
-&lt;span class=&quot;n&quot;&gt;Weights_bitpacked&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bitpack&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Filter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;weight_bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pack_axis&lt;/span&gt;&lt;span class=&quot;o&quot;& [...]
-&lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;in_height&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;in_width&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;in_channel_q&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span& [...]
-&lt;span class=&quot;n&quot;&gt;kernel_h&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kernel_w&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_filter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt [...]
-
-&lt;span class=&quot;n&quot;&gt;stride_h&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stride_w&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stride&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;pad_top&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pad_left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pad_down&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pad_right&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_pad_tuple&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;( [...]
-
-&lt;span class=&quot;c1&quot;&gt;# Computing the output shape
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out_channel&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_filter&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;out_height&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;simplify&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;in_height&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kernel_h&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pad_top&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+ [...]
-&lt;span class=&quot;n&quot;&gt;out_width&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;simplify&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;in_width&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kernel_w&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pad_left&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+& [...]
-&lt;span class=&quot;n&quot;&gt;pad_before&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pad_top&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pad_left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt [...]
-&lt;span class=&quot;n&quot;&gt;pad_after&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pad_down&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pad_right&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&l [...]
-&lt;span class=&quot;n&quot;&gt;Input_padded&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Input_bitpacked&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pad_before&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pad_after&lt;/span&gt;&lt;span class=&quot;p&quot;&g [...]
-
-&lt;span class=&quot;c1&quot;&gt;# Treat the bitplane axes like additional reduction axes
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reduce_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;in_channel_q&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&l [...]
-&lt;span class=&quot;n&quot;&gt;ry&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reduce_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kernel_h&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;s [...]
-&lt;span class=&quot;n&quot;&gt;rx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reduce_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kernel_w&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;s [...]
-&lt;span class=&quot;n&quot;&gt;ib&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reduce_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input_bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt [...]
-&lt;span class=&quot;n&quot;&gt;wb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reduce_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;weight_bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &l [...]
-
-
-&lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;compute&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out_height&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out_width&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; [...]
-             &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;popcount&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
-               &lt;span class=&quot;n&quot;&gt;Input_padded&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;yy&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stride_h&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ry&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt; [...]
-               &lt;span class=&quot;n&quot;&gt;Weights_bitpacked&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ry&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/sp [...]
-               &lt;span class=&quot;n&quot;&gt;axis&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ry&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wb&lt;/span&gt;&lt;spa [...]
-
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;In our schedule we apply common optimizations like vectorization and memory tiling to provide better
-memory locality and take advantage of SIMD units. Some of these optimizations such as tiling,
-require parameters that need to be tuned to for the specific microarchitecture. We expose these
-parameters as knobs to TVM and use AutoTVM to automatically tune all the parameters simultaneously.&lt;/p&gt;
-
-&lt;p&gt;Finally, we can craft small microkernels to replace the innermost loop(s) of computation and schedule
- them using TVM’s tensorize primitive. Since, compilers often produce suboptimal code, people can
- often write short assembly sequences that are more efficient. These microkernels often take advantage
- of new intrinsics that are being introduced to help accelerate deep learning workloads and use
- them clever ways to improve memory accesses or reduce the number instructions required.&lt;/p&gt;
-
-&lt;h2 id=&quot;results&quot;&gt;Results&lt;/h2&gt;
-
-&lt;h3 id=&quot;raspberry-pi&quot;&gt;Raspberry Pi&lt;/h3&gt;
-&lt;p&gt;Convolution speedups on Raspberry Pi 3B compared to 16-bit integer TVM implementation.
-Workload are convolution layers from ResNet18.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/low-precision/rasp-conv.png&quot; alt=&quot;image&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
-&lt;center&gt; Speedup of low precision convolutions on a Raspberry Pi compared to 16-bit TVM implementation.&lt;/center&gt;
-&lt;p&gt;&lt;/p&gt;
-
-&lt;p&gt;2-bit activation, 1-bit weight convolution speedups on Raspberry Pi 3B compared to hand optimized implementation from &lt;a href=&quot;https://arxiv.org/pdf/1712.02427.pdf&quot;&gt;High performance ultra-low-precision convolutions
-on mobile devices.&lt;/a&gt;.
-Workload are convolution layers from ResNet18.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/low-precision/rasp-conv-2.png&quot; alt=&quot;image&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
-&lt;center&gt; Speedup of 2-bit weight 1-bit activation Raspberry Pi convolutions against a hand optimized implementation.&lt;/center&gt;
-&lt;p&gt;&lt;/p&gt;
-
-&lt;h3 id=&quot;x86&quot;&gt;x86&lt;/h3&gt;
-
-&lt;p&gt;Convolution speedups on x86 compared to a 32-bit floating point TVM implementation.
-Note: x86 doesn’t support a vectorized popcount for this microarchitecture, so speedups are lower.&lt;/p&gt;
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/low-precision/x86-conv.png&quot; alt=&quot;image&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
-&lt;center&gt; Speedup of x86 low precision convolutions compared to a 32-bit floating point TVM implementation.&lt;/center&gt;
-&lt;p&gt;&lt;/p&gt;
-
-&lt;h2 id=&quot;show-me-the-code&quot;&gt;Show me the code&lt;/h2&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/nn/bitserial_conv2d.py&quot;&gt;TOPI bitserial convolution&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/arm_cpu/bitserial_conv2d.py&quot;&gt;TOPI ARM cpu bitserial convolution&lt;/a&gt;&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;[1] &lt;a href=&quot;https://arxiv.org/abs/1810.11066&quot;&gt;Automating Generation of Low Precision Deep Learning Operators&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;[2] &lt;a href=&quot;https://arxiv.org/abs/1603.05279&quot;&gt;XNOR-Net&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;[3] &lt;a href=&quot;https://arxiv.org/abs/1702.00953&quot;&gt;HWGQ&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;[4] &lt;a href=&quot;https://arxiv.org/abs/1606.06160&quot;&gt;DoReFa&lt;/a&gt;&lt;/li&gt;
-&lt;/ul&gt;</content><author><name>Meghan Cowan</name></author><summary type="html">As deep learning models grow larger and more complex, deploying them on low powered phone and IoT devices becomes challenging because of their limited compute and energy budgets. A recent trend in deep learning is the use of extremely quantized models that operate on inputs and weights of a few bits, with networks like XNOR-Net, DoReFa-Net, and HWGQ-Net making steady progress improving accuracy.</summary> [...]
\ No newline at end of file
+&lt;/ul&gt;</content><author><name>Siva</name></author><summary type="html">Introduction</summary></entry></feed>
\ No newline at end of file
diff --git a/images/community/sjtu.png b/images/community/sjtu.png
new file mode 100644
index 0000000..0de00de
Binary files /dev/null and b/images/community/sjtu.png differ
diff --git a/images/intro-auto-scheduler/code_perf.png b/images/intro-auto-scheduler/code_perf.png
new file mode 100644
index 0000000..d070a6e
Binary files /dev/null and b/images/intro-auto-scheduler/code_perf.png differ
diff --git a/images/intro-auto-scheduler/search_overview.png b/images/intro-auto-scheduler/search_overview.png
new file mode 100644
index 0000000..7b6f56d
Binary files /dev/null and b/images/intro-auto-scheduler/search_overview.png differ
diff --git a/images/intro-auto-scheduler/search_time.png b/images/intro-auto-scheduler/search_time.png
new file mode 100644
index 0000000..4bd700b
Binary files /dev/null and b/images/intro-auto-scheduler/search_time.png differ
diff --git a/images/intro-auto-scheduler/workflow.png b/images/intro-auto-scheduler/workflow.png
new file mode 100644
index 0000000..b2c7b26
Binary files /dev/null and b/images/intro-auto-scheduler/workflow.png differ
diff --git a/rss.xml b/rss.xml
index f2dfac7..2173b21 100644
--- a/rss.xml
+++ b/rss.xml
@@ -5,12 +5,142 @@
         <description>TVM - </description>
         <link>https://tvm.apache.org</link>
         <atom:link href="https://tvm.apache.org" rel="self" type="application/rss+xml" />
-        <lastBuildDate>Mon, 04 Jan 2021 16:22:52 -0500</lastBuildDate>
-        <pubDate>Mon, 04 Jan 2021 16:22:52 -0500</pubDate>
+        <lastBuildDate>Wed, 03 Mar 2021 01:20:46 -0800</lastBuildDate>
+        <pubDate>Wed, 03 Mar 2021 01:20:46 -0800</pubDate>
         <ttl>60</ttl>
 
 
         <item>
+                <title>Introducing TVM Auto-scheduler (a.k.a. Ansor)</title>
+                <description>&lt;p&gt;Optimizing the execution speed of deep neural networks is extremely hard with the growing
+model size, operator diversity, and hardware heterogeneity.
+From a computational perspective, deep neural networks are just layers and layers of tensor computations.
+These tensor computations, such as matmul and conv2d, can be easily described by mathematical expressions.
+However, providing high-performance implementations for them on modern hardware can be very challenging.
+We have to apply various low-level optimizations and utilize special hardware intrinsics to achieve high performance.
+It takes huge engineering effort to build linear algebra and neural network acceleration libraries like CuBLAS, CuDNN, oneMKL, and oneDNN.&lt;/p&gt;
+
+&lt;p&gt;Our life will be much easier if we can just write mathematical expressions and have something
+magically turn them into efficient code implementations.
+Three years ago, deep learning compiler TVM and its search module AutoTVM were built as the first step towards this goal.
+AutoTVM employs a template-based search algorithm to find efficient implementations for a given tensor computation.
+However, it is a template-based approach, so it still requires domain experts to implement a non-trivial manual template
+for every operator on every platform.
+Today, there are more than 15k lines of code for these templates in the TVM code repository.
+Besides being very hard to develop, these templates often have inefficient and limited search spaces,
+making them unable to achieve optimal performance.&lt;/p&gt;
+
+&lt;p&gt;To address the limitations of AutoTVM, we started project Ansor aiming at a fully automated auto-scheduler for 
+generating code for tensor computations.
+Ansor auto-scheduler only takes tensor expressions as input and generates high-performance code without manual templates.
+We made innovations in the search space construction and search algorithm.
+As a result, the auto-scheduler can achieve better performance with less search time in a more automated way.&lt;/p&gt;
+
+&lt;p&gt;Ansor auto-scheduler is now integrated into Apache TVM as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tvm.auto_scheduler&lt;/code&gt; package.
+This is a joint effort by collaborators from UC Berkeley, Alibaba, AWS and OctoML.
+Detailed tutorials are available for Intel CPUs, ARM CPUs, NVIDIA CPUs, and Mali GPUs on the TVM website [1].
+In this blog post, we will give a high-level introduction and show some benchmark results.&lt;/p&gt;
+
+&lt;h1 id=&quot;system-overview&quot;&gt;System Overview&lt;/h1&gt;
+
+&lt;h2 id=&quot;autotvm-vs-auto-scheduler&quot;&gt;AutoTVM vs Auto-scheduler&lt;/h2&gt;
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/intro-auto-scheduler/workflow.png&quot; alt=&quot;image&quot; width=&quot;75%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Table 1. Workflow Comparision &lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;p&gt;Table 1 compares the workflow for generating code for an operator in AutoTVM and auto-scheduler.
+In AutoTVM, the developer has to go through three steps.
+In step 1, the developer has to write the compute definition in TVM’s tensor expression language.
+This part is relatively easy because TVM’s tensor expression language looks just like math expressions.
+In step 2, the developer has to write a schedule template, which typically consists of 20-100 lines of tricky DSL code.
+This part requires domain expertise of both the target hardware architecture and operator semantics, so it is difficult.
+The last step, step 3, is automated by a search algorithm.&lt;/p&gt;
+
+&lt;p&gt;In auto-scheduler, we eliminate the most difficult step 2 by automatic search space construction and accelerate step 3 with a better search algorithm.
+By doing automatic search space construction, we not only eliminate huge manual effort, 
+but also enabling the exploration of much more optimization combinations.
+This automation does not come for free, because we still need to design rules to generate the search space.
+However, these rules are very general. They are based on static analysis of the tensor expressions.
+We only need to design a few general rules once and can apply them to almost all tensor computations in deep learning.&lt;/p&gt;
+
+&lt;h2 id=&quot;search-process&quot;&gt;Search Process&lt;/h2&gt;
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/intro-auto-scheduler/search_overview.png&quot; alt=&quot;image&quot; width=&quot;40%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Figure 1. Search Process Overview  &lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 1. shows the search process of auto-scheduler when optimizing a whole neural network.
+The system takes deep learning models as input.
+It then partitions the big model into small subgraphs with Relay’s operator fusion pass.
+A task scheduler is utilized to allocate the time resource for optimizing many subgraphs.
+At each iteration, it picks a subgraph that has the most potential to increase the end-to-end performance.
+For this subgraph, we analyze its tensor expression and generate several sketches for it.
+Then we run evolutionary search with a learned cost model to get a batch of optimized programs.
+The optimized programs are sent to actual hardware for measurements.
+When the measurements are finished, the profiling results are used as feedback to update all components of the system.
+This process is repeated iteratively until the optimization converges or we run out of time budget.
+More technical details can be found in our paper [3] and our code.&lt;/p&gt;
+
+&lt;p&gt;It is worth notiing that since the auto-scheduler generates schedules from scratch, 
+it reuses the existing computation definitions in TOPI but not schedule templates.&lt;/p&gt;
+
+&lt;h1 id=&quot;benchmark-results&quot;&gt;Benchmark Results&lt;/h1&gt;
+&lt;p&gt;In this section, we benchmark the performance of AutoTVM and Auto-scheduler.
+The CPU benchmark is done on an AWS c5.9xlarge, which is equipped with an Intel 18-core skylake 8124-m CPU. 
+The GPU benchmark is done on an AWS g4dn.4xlarge, which is equipped with an NVIDIA T4 GPU.
+All benchmark code, raw data, tuning logs can be found in this repo [2].&lt;/p&gt;
+
+&lt;h2 id=&quot;performance-of-the-generated-code&quot;&gt;Performance of the generated code&lt;/h2&gt;
+&lt;p&gt;We benchmark the fp32 single-batch inference latency on three networks.
+Figure 2 shows the relative speedup of auto-scheduler against AutoTVM.
+We can see auto-scheduler outperforms AutoTVM in all cases with 1.02x to 8.95x speedup.
+This is because auto-scheduler explores a larger search space, which covers more efficient combinations
+of optimizations that are missed in TOPI manual templates.
+The BERT-base@GPU is an extreme case where the manual templates are very badly designed.
+In other words, the manual template for dense layers does not perform well for the shapes in BERT model.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/intro-auto-scheduler/code_perf.png&quot; alt=&quot;image&quot; width=&quot;85%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Figure 2. Code Performance Comparision (Higher is better) &lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;search-time&quot;&gt;Search Time&lt;/h2&gt;
+&lt;p&gt;As we know, the search-based approaches can be very time-consuming, so we also care about the search time.
+It typically takes several hours to let the search converge for a single neural network.
+Figure 3 compares the search time of AutoTVM and auto-scheduler.
+Auto-scheduler requires much less time to converge in most cases, despite its larger search space.
+This is mainly because of auto-scheduler has a better cost model and task scheduler.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/intro-auto-scheduler/search_time.png&quot; alt=&quot;image&quot; width=&quot;85%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt; Figure 3. Search Time Comparision (Lower is better) &lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;more-results&quot;&gt;More Results&lt;/h2&gt;
+&lt;p&gt;The repo above serves as an internal benchmark tool for TVM, so it only compares the latest AutoTVM and AutoScheduler.
+You can find results for more libraries and backends in our paper [3].
+Recently, this blog post [4] also tried auto-scheduler on an Apple M1 chip and got some good results.&lt;/p&gt;
+
+&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;
+&lt;p&gt;We build TVM auto-scheduler, a system that automatically generates high-performance code for tensor expressions.
+Compared with the predecessor AutoTVM, auto-scheduler does not require manual templates.
+Besides, auto-scheduler is capable of generating schedules with better performance in a shorter time.
+We achieve this by making innovations in the search space construction and search algorithm.&lt;/p&gt;
+
+&lt;p&gt;We are excited about the current performance of auto-scheduler.
+In the future, we are interested in extending the ability of auto-scheduler to support
+sparse operators, low-precision operators, and dynamic shape better.&lt;/p&gt;
+
+&lt;h1 id=&quot;links&quot;&gt;Links&lt;/h1&gt;
+&lt;p&gt;[1] Tutorials: &lt;a href=&quot;https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling&quot;&gt;https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling&lt;/a&gt;&lt;br /&gt;
+[2] Benchmark repo: &lt;a href=&quot;https://github.com/tlc-pack/TLCBench&quot;&gt;https://github.com/tlc-pack/TLCBench&lt;/a&gt;&lt;br /&gt;
+[3] OSDI Paper: &lt;a href=&quot;https://arxiv.org/abs/2006.06762&quot;&gt;Ansor : Generating High-Performance Tensor Programs for Deep Learning&lt;/a&gt;&lt;br /&gt;
+[4] Results on Apple M1 chip: &lt;a href=&quot;https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d&quot;&gt;https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d&lt;/a&gt;.&lt;/p&gt;
+
+</description>
+                <link>https://tvm.apache.org/2021/03/03/intro-auto-scheduler</link>
+                <guid>https://tvm.apache.org/2021/03/03/intro-auto-scheduler</guid>
+                <pubDate>Wed, 03 Mar 2021 00:00:00 -0800</pubDate>
+        </item>
+
+        <item>
                 <title>Bring Your Own Datatypes: Enabling Custom Datatype Exploration in TVM</title>
                 <description>&lt;p&gt;In this post, we describe the Bring Your Own Datatypes framework, which enables the use of custom datatypes within TVM.&lt;/p&gt;
 
@@ -300,7 +430,7 @@ For more documentation about the Bring Your Own Datatypes framework
 </description>
                 <link>https://tvm.apache.org/2020/09/26/bring-your-own-datatypes</link>
                 <guid>https://tvm.apache.org/2020/09/26/bring-your-own-datatypes</guid>
-                <pubDate>Sat, 26 Sep 2020 00:00:00 -0400</pubDate>
+                <pubDate>Sat, 26 Sep 2020 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -779,7 +909,7 @@ Figure 4: After Graph Partitioning.
 </description>
                 <link>https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm</link>
                 <guid>https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm</guid>
-                <pubDate>Wed, 15 Jul 2020 00:00:00 -0400</pubDate>
+                <pubDate>Wed, 15 Jul 2020 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -1302,7 +1432,7 @@ He is a PyTorch core developer and co-authored &lt;a href=&quot;https://www.mann
 </description>
                 <link>https://tvm.apache.org/2020/07/14/bert-pytorch-tvm</link>
                 <guid>https://tvm.apache.org/2020/07/14/bert-pytorch-tvm</guid>
-                <pubDate>Tue, 14 Jul 2020 00:00:00 -0400</pubDate>
+                <pubDate>Tue, 14 Jul 2020 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -1611,7 +1741,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix multiplication microkernel&lt;/
 </description>
                 <link>https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny</link>
                 <guid>https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny</guid>
-                <pubDate>Thu, 04 Jun 2020 00:00:00 -0400</pubDate>
+                <pubDate>Thu, 04 Jun 2020 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -1698,7 +1828,7 @@ Diagram from CMSIS-NN paper showing a 2x2 matrix multiplication microkernel&lt;/
 </description>
                 <link>https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu</link>
                 <guid>https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu</guid>
-                <pubDate>Thu, 14 May 2020 00:00:00 -0400</pubDate>
+                <pubDate>Thu, 14 May 2020 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -1800,7 +1930,7 @@ relay_graph = torch_tvm.to_relay(mul, inputs)
 </description>
                 <link>https://tvm.apache.org/2019/05/30/pytorch-frontend</link>
                 <guid>https://tvm.apache.org/2019/05/30/pytorch-frontend</guid>
-                <pubDate>Thu, 30 May 2019 00:00:00 -0400</pubDate>
+                <pubDate>Thu, 30 May 2019 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -1944,7 +2074,7 @@ We show that automatic optimization in TVM makes it easy and flexible to support
 </description>
                 <link>https://tvm.apache.org/2019/04/29/opt-cuda-quantized</link>
                 <guid>https://tvm.apache.org/2019/04/29/opt-cuda-quantized</guid>
-                <pubDate>Mon, 29 Apr 2019 12:00:00 -0400</pubDate>
+                <pubDate>Mon, 29 Apr 2019 09:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -1967,7 +2097,7 @@ We show that automatic optimization in TVM makes it easy and flexible to support
 </description>
                 <link>https://tvm.apache.org/2019/03/18/tvm-apache-announcement</link>
                 <guid>https://tvm.apache.org/2019/03/18/tvm-apache-announcement</guid>
-                <pubDate>Mon, 18 Mar 2019 00:00:00 -0400</pubDate>
+                <pubDate>Mon, 18 Mar 2019 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -2137,7 +2267,7 @@ closure as TVM packed function and invoke the same across programming language b
 </description>
                 <link>https://tvm.apache.org/2019/01/19/Golang</link>
                 <guid>https://tvm.apache.org/2019/01/19/Golang</guid>
-                <pubDate>Sat, 19 Jan 2019 00:00:00 -0500</pubDate>
+                <pubDate>Sat, 19 Jan 2019 00:00:00 -0800</pubDate>
         </item>
 
         <item>
@@ -2298,7 +2428,7 @@ Note: x86 doesn’t support a vectorized popcount for this microarchitecture, so
 </description>
                 <link>https://tvm.apache.org/2018/12/18/lowprecision-conv</link>
                 <guid>https://tvm.apache.org/2018/12/18/lowprecision-conv</guid>
-                <pubDate>Tue, 18 Dec 2018 00:00:00 -0500</pubDate>
+                <pubDate>Tue, 18 Dec 2018 00:00:00 -0800</pubDate>
         </item>
 
         <item>
@@ -2414,7 +2544,7 @@ His research interest is in the general domain of ML on shared private data, but
 </description>
                 <link>https://tvm.apache.org/2018/10/09/ml-in-tees</link>
                 <guid>https://tvm.apache.org/2018/10/09/ml-in-tees</guid>
-                <pubDate>Tue, 09 Oct 2018 00:00:00 -0400</pubDate>
+                <pubDate>Tue, 09 Oct 2018 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -2808,7 +2938,7 @@ for inference deployment. TVM just provides such a solution.&lt;/p&gt;
 </description>
                 <link>https://tvm.apache.org/2018/10/03/auto-opt-all</link>
                 <guid>https://tvm.apache.org/2018/10/03/auto-opt-all</guid>
-                <pubDate>Wed, 03 Oct 2018 00:00:00 -0400</pubDate>
+                <pubDate>Wed, 03 Oct 2018 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -2947,7 +3077,7 @@ support, and can be used to implement convenient converters, such as
 </description>
                 <link>https://tvm.apache.org/2018/08/10/DLPack-Bridge</link>
                 <guid>https://tvm.apache.org/2018/08/10/DLPack-Bridge</guid>
-                <pubDate>Fri, 10 Aug 2018 00:00:00 -0400</pubDate>
+                <pubDate>Fri, 10 Aug 2018 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -3089,7 +3219,7 @@ This kind of high-level visibility is essential to system designers who want to
 </description>
                 <link>https://tvm.apache.org/2018/07/12/vta-release-announcement</link>
                 <guid>https://tvm.apache.org/2018/07/12/vta-release-announcement</guid>
-                <pubDate>Thu, 12 Jul 2018 00:00:00 -0400</pubDate>
+                <pubDate>Thu, 12 Jul 2018 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -3355,7 +3485,7 @@ C = tvm.compute(
 </description>
                 <link>https://tvm.apache.org/2018/03/23/nmt-transformer-optimize</link>
                 <guid>https://tvm.apache.org/2018/03/23/nmt-transformer-optimize</guid>
-                <pubDate>Fri, 23 Mar 2018 00:00:00 -0400</pubDate>
+                <pubDate>Fri, 23 Mar 2018 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -3471,7 +3601,7 @@ optimizations into the TVM stack.&lt;/p&gt;
 </description>
                 <link>https://tvm.apache.org/2018/03/12/webgl</link>
                 <guid>https://tvm.apache.org/2018/03/12/webgl</guid>
-                <pubDate>Mon, 12 Mar 2018 00:00:00 -0400</pubDate>
+                <pubDate>Mon, 12 Mar 2018 00:00:00 -0700</pubDate>
         </item>
 
         <item>
@@ -4045,7 +4175,7 @@ advice and &lt;a href=&quot;https://github.com/yzhliu&quot;&gt;Yizhi Liu&lt;/a&g
 </description>
                 <link>https://tvm.apache.org/2018/01/16/opt-mali-gpu</link>
                 <guid>https://tvm.apache.org/2018/01/16/opt-mali-gpu</guid>
-                <pubDate>Tue, 16 Jan 2018 00:00:00 -0500</pubDate>
+                <pubDate>Tue, 16 Jan 2018 00:00:00 -0800</pubDate>
         </item>
 
         <item>
@@ -4273,7 +4403,7 @@ make jvminstall
 </description>
                 <link>https://tvm.apache.org/2017/11/08/android-rpc-introduction</link>
                 <guid>https://tvm.apache.org/2017/11/08/android-rpc-introduction</guid>
-                <pubDate>Wed, 08 Nov 2017 00:00:00 -0500</pubDate>
+                <pubDate>Wed, 08 Nov 2017 00:00:00 -0800</pubDate>
         </item>
 
         <item>
@@ -4499,90 +4629,7 @@ BB0_6:
 </description>
                 <link>https://tvm.apache.org/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm</link>
                 <guid>https://tvm.apache.org/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm</guid>
-                <pubDate>Mon, 30 Oct 2017 00:00:00 -0400</pubDate>
-        </item>
-
-        <item>
-                <title>NNVM Compiler: Open Compiler for AI Frameworks</title>
-                <description>&lt;p style=&quot;text-align: center&quot;&gt;Paul G. Allen School of Computer Science &amp;amp; Engineering, University of Washington&lt;/p&gt;
-&lt;p style=&quot;text-align: center&quot;&gt;Amazon Web Service AI team&lt;/p&gt;
-&lt;p style=&quot;text-align: center&quot;&gt;DMLC open-source community&lt;/p&gt;
-
-&lt;p&gt;Deep learning has become ubiquitous and indispensable. We are seeing a rising need for deploying deep learning workloads on many kinds of platforms such as mobile phones, GPU, IoT devices and specialized accelerators.  Last month, we announced TVM stack to close the gap between deep learning frameworks, and the performance- or efficiency-oriented hardware backends.  TVM stack makes it easy to build an end to end compilation for a deep learning framework.  However, we think it wo [...]
-
-&lt;p&gt;Today, UW Allen school and AWS AI team, together with other contributors, are excited to announce the release of NNVM compiler, an open deep learning compiler to compile front-end framework workloads directly to hardware backends. We build it using the two-level intermediate representation(IR) in the TVM stack.
-The reader is welcome to refer to the &lt;a href=&quot;http://www.tvmlang.org/2017/08/17/tvm-release-announcement.html&quot;&gt;original TVM announcement&lt;/a&gt; for more technical details about TVM stack. With the help of TVM stack, NNVM compiler can:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;Represent and optimize the common deep learning workloads in high level graph IR&lt;/li&gt;
-  &lt;li&gt;Transform the computation graph to minimize memory utilization, optimize data layout and fuse computation patterns for different hardware backends.&lt;/li&gt;
-  &lt;li&gt;Present an end to end compilation pipeline from front-end deep learning frameworks to bare metal hardwares.&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/nnvm/nnvm_compiler_stack.png&quot; alt=&quot;image&quot; width=&quot;612px&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;The NNVM compiler can directly take models from deep learning frameworks such as Apache MXNet.
-It also support model exchange formats such as ONNX and CoreML. ONNX support enables NNVM to compile deep learning models from PyTorch, Caffe2 and CNTK.
-The CoreML frontend enables deployment of CoreML models to non-iOS devices.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/nnvm/nnvm_compiler_code.png&quot; alt=&quot;image&quot; width=&quot;712px&quot; /&gt;&lt;/p&gt;
-
-&lt;h2 id=&quot;separation-of-optimization-and-deployment&quot;&gt;Separation of Optimization and Deployment&lt;/h2&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/nnvm/nnvm_deploy.png&quot; alt=&quot;image&quot; width=&quot;512px&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;NNVM compiler applies graph level and tensor level optimizations and jointly optimize them to get the best performance. We take a different approach from existing deep learning frameworks, which packages the graph optimization with the deployment runtime.  NNVM compiler adopts the conventional wisdom from compiler to separate the optimization from the actual deployment runtime. This approach offers substantial optimization but still keeps the runtime lightweight. The compiled mo [...]
-
-&lt;h2 id=&quot;performance&quot;&gt;Performance&lt;/h2&gt;
-
-&lt;p&gt;NNVM compiler is still under active development, and we can expect more improvements to come, but we have started to see promising results.
-We benchmarked its performance and compared it against Apache MXNet on two typical hardware configurations: ARM CPU on Raspberry PI and Nvidia GPU on AWS. Despite the radical architecture difference between these two chips, we can use the same infrastructure and only need to change the schedule for each type of hardware.&lt;/p&gt;
-
-&lt;h3 id=&quot;nvidia-gpu&quot;&gt;Nvidia GPU&lt;/h3&gt;
-
-&lt;p&gt;GPU benchmarks and schedules are contributed by Leyuan Wang (AWS/UCDavis) and Yuwei Hu (TuSimple). We compared the NNVM compiler against Apache MXNet with CUDA8 and cuDNN7 as the backend on Nvidia K80. This is a very strong baseline, as Apache MXNet turns on auto-tuning to select the best kernel from CuDNN. We also used the optimized depthwise kernel in MXNet to optimize MobileNet workload.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/nnvm/nnvm_k80_result.png&quot; alt=&quot;image&quot; width=&quot;400px&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;As can be seen, NNVM compiler generate code that outperforms Apache MXNet on K80. These improvements are due to the joint graph level and kernel level optimizations. It is worth noting that NNVM compiler generates all the optimized GPU kernels on its own without relying on external libraries like CuDNN.&lt;/p&gt;
-
-&lt;h3 id=&quot;raspberry-pi-3b&quot;&gt;Raspberry Pi 3b&lt;/h3&gt;
-
-&lt;p&gt;The Rasberry Pi compilation stack is contributed by Ziheng Jiang(AWS/FDU).
-We compared NNVM compiler against Apache MXNet with OpenBLAS and NNPack.
-We explored the setups to get the best performance out of MXNet: we turned on Winograd convolution in the NNPACK for 3x3 convolutions, enabled multi-threading and disabled the additional scheduler thread (so all threads are used by NNPack).&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/nnvm/nnvm_rasp_result.png&quot; alt=&quot;image&quot; width=&quot;400px&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;As can be seen, the code generated by NNVM compiler is two times faster on ResNet18.
-The gap on MobileNet is mainly due to lack of depthwise convolution in existing CPU DNN libraries. NNVM compiler takes benefit of direct generating efficient ARM code directly.&lt;/p&gt;
-
-&lt;h2 id=&quot;acknowledgement&quot;&gt;Acknowledgement&lt;/h2&gt;
-&lt;p&gt;This project wouldn’t become possible without our early contributors in the DMLC community.
-We would like to specially thank Yuwei Hu(TuSimple), Leyuan Wang(AWS/UCDavis), Joshua Z. Zhang(AWS)
-and Xingjian Shi(HKUST) for their early contributions to the project. We would also like to thank all the contributors
-to the TVM stack.&lt;/p&gt;
-
-&lt;p&gt;We also learnt a lot from the following projects when building NNVM Compiler.&lt;/p&gt;
-&lt;ul&gt;
-  &lt;li&gt;&lt;a href=&quot;https://github.com/Theano/Theano&quot;&gt;Theano&lt;/a&gt;: possibly the earliest compiler for deep learning&lt;/li&gt;
-  &lt;li&gt;&lt;a href=&quot;https://github.com/halide/Halide&quot;&gt;Halide&lt;/a&gt;: TVM uses &lt;a href=&quot;https://github.com/dmlc/HalideIR&quot;&gt;HalideIR&lt;/a&gt; as data structure for
-arithematic simplification and low level lowering. HalideIR is derived from Halide.
-We also learns from Halide when implementing the lowering pipeline in TVM.&lt;/li&gt;
-  &lt;li&gt;&lt;a href=&quot;https://github.com/inducer/loopy&quot;&gt;Loopy&lt;/a&gt;: use of integer set analysis and its loop transformation primitives.&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;h2 id=&quot;links&quot;&gt;Links&lt;/h2&gt;
-&lt;ul&gt;
-  &lt;li&gt;Github page of NNVM Compiler: &lt;a href=&quot;https://github.com/dmlc/nnvm&quot;&gt;https://github.com/dmlc/nnvm&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;Github page of TVM: &lt;a href=&quot;https://github.com/dmlc/tvm&quot;&gt;https://github.com/dmlc/tvm&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;&lt;a href=&quot;https://news.cs.washington.edu/2017/10/06/allen-school-and-aws-team-up-on-new-nnvm-compiler-for-deep-learning-frameworks/&quot;&gt;UW Allen school blog about NNVM compiler&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;&lt;a href=&quot;https://aws.amazon.com/blogs/ai/introducing-nnvm-compiler-a-new-open-end-to-end-compiler-for-ai-frameworks/&quot;&gt;AWS blogpost about NNVM compiler&lt;/a&gt;&lt;/li&gt;
-&lt;/ul&gt;
-</description>
-                <link>https://tvm.apache.org/2017/10/06/nnvm-compiler-announcement</link>
-                <guid>https://tvm.apache.org/2017/10/06/nnvm-compiler-announcement</guid>
-                <pubDate>Fri, 06 Oct 2017 11:30:00 -0400</pubDate>
+                <pubDate>Mon, 30 Oct 2017 00:00:00 -0700</pubDate>
         </item>
 
 
diff --git a/sitemap.txt b/sitemap.txt
index bfad106..db8795d 100644
--- a/sitemap.txt
+++ b/sitemap.txt
@@ -16,6 +16,7 @@ https://tvm.apache.org/vta
 https://tvm.apache.org/feed.xml
 https://tvm.apache.org/css/custom.css.map
 
+https://tvm.apache.org/2021/03/03/intro-auto-scheduler
 https://tvm.apache.org/2020/09/26/bring-your-own-datatypes
 https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm
 https://tvm.apache.org/2020/07/14/bert-pytorch-tvm