You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by tq...@apache.org on 2020/07/15 17:12:45 UTC

[incubator-tvm-site] branch asf-site updated: Build at Wed Jul 15 09:54:06 PDT 2020

This is an automated email from the ASF dual-hosted git repository.

tqchen pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-tvm-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 6388243  Build at Wed Jul 15 09:54:06 PDT 2020
6388243 is described below

commit 638824318fd834ee3538ee0bd24b945fca6e125e
Author: tqchen <tq...@octoml.ai>
AuthorDate: Wed Jul 15 09:54:06 2020 -0700

    Build at Wed Jul 15 09:54:06 PDT 2020
---
 ...s-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html |   16 +-
 .../15/how-to-bring-your-own-codegen-to-tvm.html   |  675 ++++++++++++
 atom.xml                                           | 1076 +++++++++----------
 blog.html                                          |   10 +
 images/bring-your-own-codegen/after_annotation.png |  Bin 0 -> 25365 bytes
 .../after_merging_regions.png                      |  Bin 0 -> 26529 bytes
 .../bring-your-own-codegen/after_partitioning.png  |  Bin 0 -> 11452 bytes
 images/bring-your-own-codegen/original_graph.png   |  Bin 0 -> 25462 bytes
 rss.xml                                            | 1078 +++++++++-----------
 sitemap.txt                                        |    1 +
 10 files changed, 1671 insertions(+), 1185 deletions(-)

diff --git a/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html b/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html
index 7d0db87..07f0cb6 100644
--- a/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html
+++ b/2017/10/30/Bringing-AMDGPUs-to-TVM-Stack-and-NNVM-Compiler-with-ROCm.html
@@ -262,13 +262,13 @@ We are starting to look at performance optimization and we expect more improveme
 <p>You should see something like this:</p>
 
 <figure class="highlight"><pre><code class="language-llvm" data-lang="llvm"><span class="c1">; ModuleID = 'myadd__kernel0'</span>
-<span class="err">sour</span><span class="k">c</span><span class="err">e_filename</span> <span class="p">=</span> <span class="s">"myadd__kernel0"</span>
+<span class="err">source_filename</span> <span class="p">=</span> <span class="s">"myadd__kernel0"</span>
 <span class="k">target</span> <span class="k">datalayout</span> <span class="p">=</span> <span class="s">"e-p:32:32-p1:64:64-p2:64:64-p3:32:32-p4:64:64-p5:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64"</span>
 <span class="k">target</span> <span class="k">triple</span> <span class="p">=</span> <span class="s">"amdgcn-amd-amdhsa-hcc"</span>
 
 
 <span class="c1">; Function Attrs: nounwind</span>
-<span class="k">define</span> <span class="k">dllexport</span> <span class="err">amdgpu_ker</span><span class="k">ne</span><span class="err">l</span> <span class="kt">void</span> <span class="vg">@myadd__kernel0</span><span class="p">(</span><span class="kt">float</span> <span class="k">add</span><span class="err">rspa</span><span class="k">c</span><span class="err">e</span><span class="p">(</span><span class="m">1</span><span class="p">)*</span> <span class="k">noalias</span> <span clas [...]
+<span class="k">define</span> <span class="k">dllexport</span> <span class="err">amdgpu_kernel</span> <span class="kt">void</span> <span class="vg">@myadd__kernel0</span><span class="p">(</span><span class="kt">float</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">1</span><span class="p">)*</span> <span class="k">noalias</span> <span class="k">nocapture</span><span class="p">,</span> <span class="kt">float</span> <span class="k">addrspace</span><span class= [...]
 <span class="nl">entry:</span>
   <span class="nv">%4</span> <span class="p">=</span> <span class="k">tail</span> <span class="k">call</span> <span class="kt">i32</span> <span class="vg">@llvm.amdgcn.workgroup.id.x</span><span class="p">()</span>
   <span class="nv">%5</span> <span class="p">=</span> <span class="k">tail</span> <span class="k">call</span> <span class="kt">i32</span> <span class="vg">@llvm.amdgcn.workitem.id.x</span><span class="p">()</span>
@@ -288,14 +288,14 @@ We are starting to look at performance optimization and we expect more improveme
   <span class="nv">%10</span> <span class="p">=</span> <span class="k">add</span> <span class="k">nsw</span> <span class="kt">i32</span> <span class="nv">%.pre-phi</span><span class="p">,</span> <span class="nv">%5</span>
   <span class="nv">%11</span> <span class="p">=</span> <span class="k">add</span> <span class="k">nsw</span> <span class="kt">i32</span> <span class="nv">%.pre-phi</span><span class="p">,</span> <span class="nv">%5</span>
   <span class="nv">%12</span> <span class="p">=</span> <span class="k">sext</span> <span class="kt">i32</span> <span class="nv">%11</span> <span class="k">to</span> <span class="kt">i64</span>
-  <span class="nv">%13</span> <span class="p">=</span> <span class="k">getelementptr</span> <span class="k">inbounds</span> <span class="kt">float</span><span class="p">,</span> <span class="kt">float</span> <span class="k">add</span><span class="err">rspa</span><span class="k">c</span><span class="err">e</span><span class="p">(</span><span class="m">1</span><span class="p">)*</span> <span class="nv">%2</span><span class="p">,</span> <span class="kt">i64</span> <span class="nv">%12</span>
-  <span class="nv">%14</span> <span class="p">=</span> <span class="k">load</span> <span class="kt">float</span><span class="p">,</span> <span class="kt">float</span> <span class="k">add</span><span class="err">rspa</span><span class="k">c</span><span class="err">e</span><span class="p">(</span><span class="m">1</span><span class="p">)*</span> <span class="nv">%13</span><span class="p">,</span> <span class="k">align</span> <span class="m">4</span><span class="p">,</span> <span class="nv" [...]
-  <span class="nv">%15</span> <span class="p">=</span> <span class="k">getelementptr</span> <span class="k">inbounds</span> <span class="kt">float</span><span class="p">,</span> <span class="kt">float</span> <span class="k">add</span><span class="err">rspa</span><span class="k">c</span><span class="err">e</span><span class="p">(</span><span class="m">1</span><span class="p">)*</span> <span class="nv">%1</span><span class="p">,</span> <span class="kt">i64</span> <span class="nv">%12</span>
-  <span class="nv">%16</span> <span class="p">=</span> <span class="k">load</span> <span class="kt">float</span><span class="p">,</span> <span class="kt">float</span> <span class="k">add</span><span class="err">rspa</span><span class="k">c</span><span class="err">e</span><span class="p">(</span><span class="m">1</span><span class="p">)*</span> <span class="nv">%15</span><span class="p">,</span> <span class="k">align</span> <span class="m">4</span><span class="p">,</span> <span class="nv" [...]
+  <span class="nv">%13</span> <span class="p">=</span> <span class="k">getelementptr</span> <span class="k">inbounds</span> <span class="kt">float</span><span class="p">,</span> <span class="kt">float</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">1</span><span class="p">)*</span> <span class="nv">%2</span><span class="p">,</span> <span class="kt">i64</span> <span class="nv">%12</span>
+  <span class="nv">%14</span> <span class="p">=</span> <span class="k">load</span> <span class="kt">float</span><span class="p">,</span> <span class="kt">float</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">1</span><span class="p">)*</span> <span class="nv">%13</span><span class="p">,</span> <span class="k">align</span> <span class="m">4</span><span class="p">,</span> <span class="nv">!tbaa</span> <span class="nv">!2</span>
+  <span class="nv">%15</span> <span class="p">=</span> <span class="k">getelementptr</span> <span class="k">inbounds</span> <span class="kt">float</span><span class="p">,</span> <span class="kt">float</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">1</span><span class="p">)*</span> <span class="nv">%1</span><span class="p">,</span> <span class="kt">i64</span> <span class="nv">%12</span>
+  <span class="nv">%16</span> <span class="p">=</span> <span class="k">load</span> <span class="kt">float</span><span class="p">,</span> <span class="kt">float</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">1</span><span class="p">)*</span> <span class="nv">%15</span><span class="p">,</span> <span class="k">align</span> <span class="m">4</span><span class="p">,</span> <span class="nv">!tbaa</span> <span class="nv">!6</span>
   <span class="nv">%17</span> <span class="p">=</span> <span class="k">fadd</span> <span class="kt">float</span> <span class="nv">%14</span><span class="p">,</span> <span class="nv">%16</span>
   <span class="nv">%18</span> <span class="p">=</span> <span class="k">sext</span> <span class="kt">i32</span> <span class="nv">%10</span> <span class="k">to</span> <span class="kt">i64</span>
-  <span class="nv">%19</span> <span class="p">=</span> <span class="k">getelementptr</span> <span class="k">inbounds</span> <span class="kt">float</span><span class="p">,</span> <span class="kt">float</span> <span class="k">add</span><span class="err">rspa</span><span class="k">c</span><span class="err">e</span><span class="p">(</span><span class="m">1</span><span class="p">)*</span> <span class="nv">%0</span><span class="p">,</span> <span class="kt">i64</span> <span class="nv">%18</span>
-  <span class="k">store</span> <span class="kt">float</span> <span class="nv">%17</span><span class="p">,</span> <span class="kt">float</span> <span class="k">add</span><span class="err">rspa</span><span class="k">c</span><span class="err">e</span><span class="p">(</span><span class="m">1</span><span class="p">)*</span> <span class="nv">%19</span><span class="p">,</span> <span class="k">align</span> <span class="m">4</span><span class="p">,</span> <span class="nv">!tbaa</span> <span clas [...]
+  <span class="nv">%19</span> <span class="p">=</span> <span class="k">getelementptr</span> <span class="k">inbounds</span> <span class="kt">float</span><span class="p">,</span> <span class="kt">float</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">1</span><span class="p">)*</span> <span class="nv">%0</span><span class="p">,</span> <span class="kt">i64</span> <span class="nv">%18</span>
+  <span class="k">store</span> <span class="kt">float</span> <span class="nv">%17</span><span class="p">,</span> <span class="kt">float</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">1</span><span class="p">)*</span> <span class="nv">%19</span><span class="p">,</span> <span class="k">align</span> <span class="m">4</span><span class="p">,</span> <span class="nv">!tbaa</span> <span class="nv">!9</span>
   <span class="k">br</span> <span class="kt">label</span> <span class="nv">%if_end</span>
 
 
diff --git a/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html b/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html
new file mode 100644
index 0000000..ca7024e
--- /dev/null
+++ b/2020/07/15/how-to-bring-your-own-codegen-to-tvm.html
@@ -0,0 +1,675 @@
+
+<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8">
+    <title>How to Bring Your Own Codegen to TVM</title>
+    
+    <meta name="author" content="">
+
+    <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
+    <!--[if lt IE 9]>
+      <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
+    <![endif]-->
+
+    <!-- Le styles -->
+    <link href="/assets/themes/custom-twitter/css/1.4.0/bootstrap.css" rel="stylesheet">
+    <link href="/assets/themes/custom-twitter/css/style.css?body=1" rel="stylesheet" type="text/css" media="all">
+
+    <!-- Le fav and touch icons -->
+  <!-- Update these with your own images
+    <link rel="shortcut icon" href="images/logo/tvm-logo.png">
+  <link rel="shortcut icon" href="images/logo/tvm-logo.png">
+  -->
+  <link href="/images/logo/tvm-logo-square.png" rel="icon" type="image/png"/>
+  <!-- Global site tag (gtag.js) - Google Analytics -->
+  <script async src="https://www.googletagmanager.com/gtag/js?id=UA-75982049-2"></script>
+  <script>
+    window.dataLayer = window.dataLayer || [];
+    function gtag(){dataLayer.push(arguments);}
+
+    gtag('js', new Date());
+    gtag('config', 'UA-75982049-2');
+  </script>
+
+</head>
+
+  <body>
+    <div class="topbar">
+      <div class="fill">
+        <div class="container">
+          <h2 id="logo-wrap">
+            <a href="/" class="nav">
+              <img src="/images/logo/tvm-logo-small-black.png" width="100px">
+            </a>
+          </h2>
+          <ul class="nav" id="nav-bar">
+            
+            
+            
+
+
+
+  
+    
+      
+      
+    
+  
+    
+      
+      
+    
+  
+    
+      
+      
+    
+  
+    
+      
+      
+    
+  
+    
+      
+      
+    
+  
+    
+      
+      
+    
+  
+    
+      
+      	
+      	<li><a href="/community">Community</a></li>
+      	
+      
+      
+    
+  
+    
+      
+      	
+      	<li><a href="/download">Download</a></li>
+      	
+      
+      
+    
+  
+    
+      
+      	
+      	<li><a href="/about">About</a></li>
+      	
+      
+      
+    
+  
+    
+      
+      
+    
+  
+    
+      
+      	
+      	<li><a href="/vta">VTA</a></li>
+      	
+      
+      
+    
+  
+    
+      
+      
+      	
+      	<li><a href="/blog">Blog</a></li>
+      	
+      
+    
+  
+
+
+
+
+            <li> <a href="https://tvm.apache.org/docs">Docs</a></li>
+            <li> <a href="https://tvmconf.org">TVM Conference</a></li>
+            <li> <a href="https://github.com/apache/incubator-tvm/">Github</a></li>
+            <li> <a href="/asf">ASF</a></li>
+          </ul>
+        </div>
+      </div>
+    </div>
+    
+<div class="container">
+<div class="content">
+  <div class="row">
+    <div class="span14">
+      <h1>How to Bring Your Own Codegen to TVM </h1>
+      <p class="post-meta">
+        <time datetime="2020-07-15T00:00:00-07:00" itemprop="datePublished">
+          Jul 15, 2020
+        </time>
+        
+        • <span itemprop="author" itemscope itemtype="http://schema.org/Person">
+          <span itemprop="name">Zhi Chen and Cody Yu, Amazon Web Services, Inc</span>
+        </span>
+        
+      </p>
+      <p class="post-meta">
+        </p>
+    </br>
+    <p>To free data scientists from worrying about the performance when developing a new model, hardware backend providers (e.g., Intel, NVIDIA, ARM, etc) either provide kernel libraries such as cuBLAS or cuDNN with many commonly used deep learning kernels, or provide frameworks such as DNNL or TensorRT with a graph engine to let users describe their models in a certain way to achieve high performance. In addition, emerging deep learning accelerators also have their own compilers, kernel [...]
+
+<p>However, users have to learn a new programming interface when they attempt to work on a new kernel library or a device. As a result, the demand for a unified programming interface becomes more and more important to let all users and hardware backend providers stand on the same page.</p>
+
+<p>To share the programming interface with widely used deep learning frameworks, many hardware device providers have attempted to integrate their devices backend to TensorFlow. However, since TensorFlow does not provide an official backend interface for new backends, you have to hack the TensorFlow for registration, which involves many source file changes and makes the future maintenance difficult.</p>
+
+<p>In this post, we demonstrate how you, as a hardware backend provider, can easily leverage the Bring Your Own Codegen (BYOC) framework to integrate the kernel library/compiler/framework of your hardware device to TVM. The most important advantage of leveraging BYOC framework is that <strong><em>all related source files of your devices are self-contained, so the codegen/runtime of your devices are pluggable to the TVM code base.</em></strong> It means that 1) the TVM code base with your [...]
+
+<p>In the rest of this post, we first illustrate a scenario that you may need TVM with BYOC, followed by an overview of the BYOC compilation and runtime flows. Then, we step-by-step illustrate how to integrate a vendor library or an execution engine to TVM with BYOC by using Intel DNNL (a.k.a. MKL-DNN, OneDNN) as a running example.</p>
+
+<h2 id="bring-an-asic-accelerator-to-tvm">Bring an ASIC Accelerator to TVM</h2>
+
+<p>Let’s first make a scenario to illustrate why you want to bring your accelerator to TVM and what features you can expect from the BYOC framework. If you are not sure whether your case is suitable for BYOC, you are welcome to raise a discussion at <a href="https://discuss.tvm.ai">discuss.tvm.ai</a>.</p>
+
+<p>Imagining that you just made an edge device platform with an ARM CPU and a fantastic accelerator that has achieved amazing performance for common image classification models. In other words, your accelerator does well on Conv2D, ReLU, GEMM, and other widely used CNN operators.</p>
+
+<p>Unfortunately, object detection models are getting more and more popular as well, and your customers need to run both image classification and object detection models on your platform. Although your accelerator is capable of executing almost all operators in object detection models, one operator (e.g., non-maximum suppression, NMS) is missing.</p>
+
+<h3 id="let-tvm-execute-unsupported-operators">Let TVM execute unsupported operators</h3>
+<p>Since TVM has multiple codegens for different backends, it is easy for the open source community to implement new operators on CPU or GPU in a short time. Ideally, if you integrate the compilation flow of your accelerator to TVM with BYOC, TVM will perform Relay graph partitioning to offload a part of the graph to your accelerator while keeping others on TVM. As a result, you can claim that your platform is capable of running all models without worrying about new operators.</p>
+
+<h3 id="customize-graph-level-optimization">Customize graph-level optimization</h3>
+<p>Your ASIC accelerator must have its own compilation flow. Usually, it could be one of the following cases:</p>
+
+<p><strong>Generate a graph representation and feed it to a graph engine</strong>:
+You may have your own graph engine that is capable of executing a graph (or a neural network model) on your accelerator. For example, both Intel DNNL and NVIDIA TensorRT use an engine to run a whole graph or a model, so that they are able to 1) reduce memory transaction between operators and 2) optimize graph execution with operator fusion.</p>
+
+<p>In order to achieve the above two optimizations, you may need to process the graph during the compilation time. For example, Conv2D and bias addition are two separate operators in TVM, but they may be one operator (Conv2D with bias addition capability) on your accelerator. In this case, you may want to optimize the graph by replacing the <code class="highlighter-rouge">conv2d - add</code> graph pattern to a <code class="highlighter-rouge">your_conv2d_with_bias</code> node.</p>
+
+<p>If your compilation flow falls into this case, then we recommend reading all the rest sections in this post but skipping <a href="#bring-dnnl-to-tvm-c-source-codegen">Bring DNNL to TVM: C Source Codegen</a>.</p>
+
+<p><strong>Generate assembly code and compile it to an executable binary</strong>:
+If you do not have an end-to-end execution framework for your platform like the previous case, you may have a compiler to compile a program in assembly code of your ISA. In order to feed the assembly code to your compiler, you will need a codegen to generate and optimize the assembly code from a Relay graph.</p>
+
+<p>If your compilation flow falls into this case, then we recommend reading all the rest sections in this post but skipping <a href="#bring-dnnl-to-tvm-json-codegenruntime">Bring DNNL to TVM: JSON Codegen/Runtime</a>.</p>
+
+<h2 id="how-byoc-works">How BYOC Works</h2>
+
+<p>We then briefly explain how BYOC framework works. For more detail explanations of underlying framework components and their implementations, please refer to the <a href="[https://tvm.apache.org/docs/dev/relay_bring_your_own_codegen.html](https://tvm.apache.org/docs/dev/relay_bring_your_own_codegen.html)">developer document</a>. In short, given a Relay graph in Figure 1, BYOC framework does the following steps:</p>
+
+<p style="text-align: center"><img src="/images/bring-your-own-codegen/original_graph.png" alt="The original Relay graph" width="50%" /></p>
+<center>
+Figure 1: The Original Relay Graph.
+</center>
+<p></p>
+
+<h3 id="1-graph-annotation">1. Graph Annotation</h3>
+<p>Taking a user-provided Relay graph, our first step is to annotate the nodes that potentially can be offloaded to your accelerator in the graph. You will need to follow <a href="#bring-dnnl-to-tvm-annotation-rules">Bring DNNL to TVM: Annotation Rules</a> to implement a whitelist of supported operators, or a graph pattern list of customized composite operators. An example annotation result is shown in Figure 2.</p>
+
+<p style="text-align: center"><img src="/images/bring-your-own-codegen/after_annotation.png" alt="The Graph with Annotations" width="50%" /></p>
+<center>
+Figure 2: The Graph with Annotations.
+</center>
+<p></p>
+
+<h3 id="2-graph-transformation">2. Graph Transformation</h3>
+<p>The second step is to transform and optimize the graph based on the annotations. Specifically, BYOC performs the following transformations.</p>
+
+<p><strong>2.1: Merge compiler region</strong>: As can be seen in Figure 2, we now have many “regions” in the graph that can be offloaded to your accelerator, but some of them can actually be merged to reduce the data transfer and kernel launching overhead. Accordingly, step 2.1 uses a greedy algorithm to merge as many of those regions as possible while guaranteeing the functional correctness. The result is depicted in Figure 3.</p>
+
+<p style="text-align: center"><img src="/images/bring-your-own-codegen/after_merging_regions.png" alt="After Merging Compiler Regions" width="50%" /></p>
+<center>
+Figure 3: After Merging Compiler Regions.
+</center>
+<p></p>
+
+<p><strong>2.2: Partition Graph</strong>: For each region from the previous step, we create a Relay function with an attribute <code class="highlighter-rouge">Compiler</code> to indicate that this Relay function should be entirely offloaded to your accelerator, as shown in Figure 4.</p>
+
+<p style="text-align: center"><img src="/images/bring-your-own-codegen/after_partitioning.png" alt="After Graph Partitioning" width="50%" /></p>
+<center>
+Figure 4: After Graph Partitioning.
+</center>
+<p></p>
+
+<h3 id="3-code-generation">3. Code Generation</h3>
+<p>Now we know which part of the Relay graph should be offloaded. In this step, we sequentially send every Relay function with <code class="highlighter-rouge">Compiler=your_accelerator</code> to your codegen. Your codegen should compile the Relay function to the form that matches your own compilation flow. It can be either C source code or any text formats.</p>
+
+<p>Finally, all compiled functions will be serialized along with other non-offloaded Relay functions to a single <code class="highlighter-rouge">.so</code> file by the TVM <code class="highlighter-rouge">export_library</code> Python API. In other words, the user will get only one <code class="highlighter-rouge">.so</code> file after running this flow.</p>
+
+<h3 id="4-runtime">4. Runtime</h3>
+<p>You may also need to implement a runtime to initialize your graph engine (if applicable) and execute the compiled functions. During the inference, TVM runtime (i.e., graph runtime or VM) will leverage your runtime to invoke the offloaded functions when the TVM runtime encounters the corresponding function call in Figure 4. Your runtime is responsible for launching the compiled function with the given input tensor arrays and filling in the results to the output tensor arrays.</p>
+
+<p>In the rest of this post, we use DNNL as an example to demonstrate how to achieve the above workflow using the BYOC framework. Please note that all referred code and line number in this post are based on the TVM repository’s master branch commit <a href="https://github.com/apache/incubator-tvm/tree/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8">8a0249c</a>.</p>
+
+<h2 id="bring-dnnl-to-tvm-annotation-rules">Bring DNNL to TVM: Annotation Rules</h2>
+
+<p>The BYOC framework provides two approaches for you to describe the supported operators and patterns. You can use both of them simultaneously. In this section, we use DNNL as an example to show how to make use of them. The complete implementation is available <a href="https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/python/tvm/relay/op/contrib/dnnl.py">here</a>. Note that we put the annotation rules for your codegen under <code class="highlighter-ro [...]
+
+<h3 id="rules-for-single-operators">Rules for single operators</h3>
+<p>You can intuitively specify which Relay operators are supported by your accelerator with the BYOC API. For example, we use the following code snippet to build a rule saying that our DNNL codegen supports Conv2D:</p>
+
+<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">tvm</span><span class="o">.</span><span class="n">ir</span><span class="o">.</span><span class="n">register_op_attr</span><span class="p">(</span><span class="s">"nn.conv2d"</span><span class="p">,</span> <span class="s">"target.dnnl"</span><span class="p">)</span>
+<span class="k">def</span> <span class="nf">_dnnl_conv2d_wrapper</span><span class="p">(</span><span class="n">attrs</span><span class="p">,</span> <span class="n">args</span><span class="p">):</span>
+  <span class="k">return</span> <span class="bp">True</span>
+</code></pre></div></div>
+<p>This registers a new attribute <code class="highlighter-rouge">target.dnnl</code> to Relay <code class="highlighter-rouge">nn.conv2d</code> operator.  By this way, the BYOC annotation could invoke <code class="highlighter-rouge">target.dnnl()</code> for every operator in the graph to check if it is supported in DNNL codegen.</p>
+
+<p>On the other hand, it might be tedious to write the above code snippet for every single operator. For the DNNL implementation, we implemented a helper function, <code class="highlighter-rouge">_register_external_op_helper</code>, to make our life easier:</p>
+
+<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">_register_external_op_helper</span><span class="p">(</span><span class="n">op_name</span><span class="p">,</span> <span class="n">supported</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span>
+    <span class="o">@</span><span class="n">tvm</span><span class="o">.</span><span class="n">ir</span><span class="o">.</span><span class="n">register_op_attr</span><span class="p">(</span><span class="n">op_name</span><span class="p">,</span> <span class="s">"target.dnnl"</span><span class="p">)</span>
+    <span class="k">def</span> <span class="nf">_func_wrapper</span><span class="p">(</span><span class="n">attrs</span><span class="p">,</span> <span class="n">args</span><span class="p">):</span>
+        <span class="k">return</span> <span class="n">supported</span>
+    <span class="k">return</span> <span class="n">_func_wrapper</span>
+
+<span class="n">_register_external_op_helper</span><span class="p">(</span><span class="s">"nn.batch_norm"</span><span class="p">)</span>
+<span class="n">_register_external_op_helper</span><span class="p">(</span><span class="s">"nn.conv2d"</span><span class="p">)</span>
+<span class="n">_register_external_op_helper</span><span class="p">(</span><span class="s">"nn.dense"</span><span class="p">)</span>
+<span class="n">_register_external_op_helper</span><span class="p">(</span><span class="s">"nn.relu"</span><span class="p">)</span>
+<span class="n">_register_external_op_helper</span><span class="p">(</span><span class="s">"add"</span><span class="p">)</span>
+<span class="n">_register_external_op_helper</span><span class="p">(</span><span class="s">"subtract"</span><span class="p">)</span>
+<span class="n">_register_external_op_helper</span><span class="p">(</span><span class="s">"multiply"</span><span class="p">)</span>
+</code></pre></div></div>
+<p>In the above example, we specify a list of operators that can be supported by DNNL codegen.</p>
+
+<h3 id="rules-for-graph-patterns">Rules for graph patterns</h3>
+<p>Your accelerator or compiler may have optimized some patterns (e.g., Conv2D + add + ReLU) to be a single instruction or an API. In this case, you can specify a mapping from a graph pattern to your instruction/API. For the case of the DNNL, its Conv2D API already includes bias addition and it allows the next ReLU to be attached, so we can call DNNL as the following code snippet (the complete implementation can be found <a href="[https://github.com/apache/incubator-tvm/blob/master/src/r [...]
+
+<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">DNNLConv2d</span><span class="p">(</span><span class="k">const</span> <span class="n">bool</span> <span class="n">has_bias</span> <span class="o">=</span> <span class="nb">false</span><span class="p">,</span> <span class="k">const</span> <span class="n">bool</span> <span class="n">has_relu</span> <span class="o">=</span> <span class="nb">false</span><span class="p">)</span> <span [...]
+  <span class="c1">// ... skip ...</span>
+  <span class="k">auto</span> <span class="n">conv_desc</span> <span class="o">=</span> <span class="n">dnnl</span><span class="o">::</span><span class="n">convolution_forward</span><span class="o">::</span><span class="n">desc</span><span class="p">(</span>
+    <span class="n">dnnl</span><span class="o">::</span><span class="n">prop_kind</span><span class="o">::</span><span class="n">forward_inference</span><span class="p">,</span>
+    <span class="n">dnnl</span><span class="o">::</span><span class="n">algorithm</span><span class="o">::</span><span class="n">convolution_direct</span><span class="p">,</span>
+    <span class="n">conv_src_md</span><span class="p">,</span> <span class="n">conv_weights_md</span><span class="p">,</span> <span class="n">conv_bias_md</span><span class="p">,</span> <span class="n">conv_dst_md</span><span class="p">,</span>
+    <span class="n">strides_dims</span><span class="p">,</span> <span class="n">padding_dims_l</span><span class="p">,</span> <span class="n">padding_dims_r</span><span class="p">);</span>
+
+  <span class="c1">// Attach ReLU</span>
+  <span class="n">dnnl</span><span class="o">::</span><span class="n">primitive_attr</span> <span class="n">attr</span><span class="p">;</span>
+  <span class="k">if</span> <span class="p">(</span><span class="n">has_relu</span><span class="p">)</span> <span class="p">{</span>
+    <span class="n">dnnl</span><span class="o">::</span><span class="n">post_ops</span> <span class="n">ops</span><span class="p">;</span>
+    <span class="n">ops</span><span class="p">.</span><span class="n">append_eltwise</span><span class="p">(</span><span class="mi">1</span><span class="p">.</span><span class="n">f</span><span class="p">,</span> <span class="n">dnnl</span><span class="o">::</span><span class="n">algorithm</span><span class="o">::</span><span class="n">eltwise_relu</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="n">f</span><span class="p">,</span> <span class= [...]
+    <span class="n">attr</span><span class="p">.</span><span class="n">set_post_ops</span><span class="p">(</span><span class="n">ops</span><span class="p">);</span>
+  <span class="p">}</span>
+
+  <span class="k">auto</span> <span class="n">conv2d_prim_desc</span> <span class="o">=</span> <span class="n">dnnl</span><span class="o">::</span><span class="n">convolution_forward</span><span class="o">::</span><span class="n">primitive_desc</span><span class="p">(</span>
+    <span class="n">conv_desc</span><span class="p">,</span> <span class="n">attr</span><span class="p">,</span> <span class="n">engine_</span><span class="p">);</span>
+  <span class="c1">// ... skip ...</span>
+</code></pre></div></div>
+<p>In this case, except for a single <code class="highlighter-rouge">conv2d</code>, we would like to map the graph pattern <code class="highlighter-rouge">conv2d+relu</code> to <code class="highlighter-rouge">DNNLConv2d(false, true)</code>, and map <code class="highlighter-rouge">conv2d+add+relu</code> to <code class="highlighter-rouge">DNNLConv2d(true, true)</code>. We can achieve it with the following code snippet:</p>
+
+<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">make_pattern</span><span class="p">(</span><span class="n">with_bias</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span>
+  <span class="n">data</span> <span class="o">=</span> <span class="n">wildcard</span><span class="p">()</span>
+  <span class="n">weight</span> <span class="o">=</span> <span class="n">wildcard</span><span class="p">()</span>
+  <span class="n">bias</span> <span class="o">=</span> <span class="n">wildcard</span><span class="p">()</span>
+  <span class="n">conv</span> <span class="o">=</span> <span class="n">is_op</span><span class="p">(</span><span class="s">'nn.conv2d'</span><span class="p">)(</span><span class="n">data</span><span class="p">,</span> <span class="n">weight</span><span class="p">)</span>
+  <span class="k">if</span> <span class="n">with_bias</span><span class="p">:</span>
+    <span class="n">conv_out</span> <span class="o">=</span> <span class="n">is_op</span><span class="p">(</span><span class="s">'add'</span><span class="p">)(</span><span class="n">conv</span><span class="p">,</span> <span class="n">bias</span><span class="p">)</span>
+  <span class="k">else</span><span class="p">:</span>
+    <span class="n">conv_out</span> <span class="o">=</span> <span class="n">conv</span>
+  <span class="k">return</span> <span class="n">is_op</span><span class="p">(</span><span class="s">'nn.relu'</span><span class="p">)(</span><span class="n">conv_out</span><span class="p">)</span>
+
+<span class="o">@</span><span class="n">register_pattern_table</span><span class="p">(</span><span class="s">"dnnl"</span><span class="p">)</span>
+<span class="k">def</span> <span class="nf">pattern_table</span><span class="p">():</span>
+  <span class="n">conv2d_bias_relu_pat</span> <span class="o">=</span> <span class="p">(</span><span class="s">"dnnl.conv2d_bias_relu"</span><span class="p">,</span> <span class="n">make_pattern</span><span class="p">(</span><span class="n">with_bias</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
+  <span class="n">conv2d_relu_pat</span> <span class="o">=</span> <span class="p">(</span><span class="s">"dnnl.conv2d_relu"</span><span class="p">,</span> <span class="n">make_pattern</span><span class="p">(</span><span class="n">with_bias</span><span class="o">=</span><span class="bp">False</span><span class="p">))</span>
+  <span class="n">dnnl_patterns</span> <span class="o">=</span> <span class="p">[</span><span class="n">conv2d_bias_relu_pat</span><span class="p">,</span> <span class="n">conv2d_relu_pat</span><span class="p">]</span>
+  <span class="k">return</span> <span class="n">dnnl_patterns</span>
+</code></pre></div></div>
+
+<p>In the DNNL example, we implemented two patterns with different names so that we can easily recognize them in the codegen. Note that the patterns are implemented in the Relay pattern language. You can follow <a href="https://tvm.apache.org/docs/langref/relay_pattern.html">this tutorial</a> to learn how to write your own patterns.</p>
+
+<p>With the pattern table, we can then use a Relay pass to perform the transformation from</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>%1 = nn.conv2d(%data, %weight, ...)
+%2 = add(%1, %bias)
+%3 = nn.relu(%2)
+</code></pre></div></div>
+<p>to</p>
+<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>%1 = fn(%input1, %input2, %input3,
+        Composite="dnnl.conv2d_bias_relu",
+        PartitionedFromPattern="nn.conv2d_add_nn.relu_") {
+  %1 = nn.conv2d(%input1, %input2, ...)
+  %2 = add(%1, %input3)
+  nn.relu(%2)
+}
+%2 = %1(%data, %weight, %bias)
+</code></pre></div></div>
+<p>Thus, the DNNL codegen can get the pattern name <code class="highlighter-rouge">conv2d_bias_relu</code> and map <code class="highlighter-rouge">%1</code> to <code class="highlighter-rouge">DNNLConv2d(true, true)</code>.</p>
+
+<p>As you may have noticed that we also have an attribute called “PartitionedFromPattern” in the composite function. This could be helpful if your pattern contains <code class="highlighter-rouge">wildcard</code> operators. For example we may have a pattern table <code class="highlighter-rouge">("conv2d_with_something", conv2d -&gt; *)</code>:</p>
+
+<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">make_pattern</span><span class="p">(</span><span class="n">with_bias</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span>
+  <span class="n">data</span> <span class="o">=</span> <span class="n">wildcard</span><span class="p">()</span>
+  <span class="n">weight</span> <span class="o">=</span> <span class="n">wildcard</span><span class="p">()</span>
+  <span class="n">conv</span> <span class="o">=</span> <span class="n">is_op</span><span class="p">(</span><span class="s">'nn.conv2d'</span><span class="p">)(</span><span class="n">data</span><span class="p">,</span> <span class="n">weight</span><span class="p">)</span>
+  <span class="k">return</span> <span class="n">wildcard</span><span class="p">()(</span><span class="n">conv</span><span class="p">)</span>
+</code></pre></div></div>
+<p>In this case, you will get a composite function with <code class="highlighter-rouge">Composite=conv2d_with_something</code>, but you have no idea about what graph it actually matched. That’s where PartitionedFromPattern comes into play. You can know that if the matched graph is <code class="highlighter-rouge">conv2d -&gt; add</code> or <code class="highlighter-rouge">conv2d -&gt; relu</code> by looking at <code class="highlighter-rouge">PartitionedFromPattern</code> to see if it is <c [...]
+
+<h2 id="bring-dnnl-to-tvm-relay-graph-transformation">Bring DNNL to TVM: Relay Graph Transformation</h2>
+<p>With the annotation rules from the previous step, we can now apply a list of BYOC Relay passes to transform the Relay graph from Figure 1 to Figure 4:</p>
+
+<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mod</span> <span class="o">=</span> <span class="n">create_relay_module_from_model</span><span class="p">()</span> <span class="c1"># Output: Figure 1
+</span><span class="n">mod</span> <span class="o">=</span> <span class="n">transform</span><span class="o">.</span><span class="n">MergeComposite</span><span class="p">(</span><span class="n">pattern_table</span><span class="p">)(</span><span class="n">mod</span><span class="p">)</span>
+<span class="n">mod</span> <span class="o">=</span> <span class="n">transform</span><span class="o">.</span><span class="n">AnnotateTarget</span><span class="p">([</span><span class="s">"dnnl"</span><span class="p">])(</span><span class="n">mod</span><span class="p">)</span> <span class="c1"># Output: Figure 2
+</span><span class="n">mod</span> <span class="o">=</span> <span class="n">transform</span><span class="o">.</span><span class="n">MergeCompilerRegions</span><span class="p">()(</span><span class="n">mod</span><span class="p">)</span> <span class="c1"># Output: Figure 3
+</span><span class="n">mod</span> <span class="o">=</span> <span class="n">transform</span><span class="o">.</span><span class="n">PartitionGraph</span><span class="p">()(</span><span class="n">mod</span><span class="p">)</span> <span class="c1"># Output: Figure 4
+</span></code></pre></div></div>
+<p>As can be seen, each Relay pass can be mapped to a step we have introduced in <a href="#how-byoc-works">How BYOC Works</a>.</p>
+
+<h2 id="bring-dnnl-to-tvm-json-codegenruntime">Bring DNNL to TVM: JSON Codegen/Runtime</h2>
+<p>Now let’s implement the DNNL codegen that serializes a Relay graph to a JSON representation, and then implement the DNNL JSON runtime to deserialize and execute the graph. <em>Note that if you attempt to implement a codegen to generate C-compatible programs, you may want to directly proceed to the next section.</em></p>
+
+<p>To enable DNNL JSON codegen/runtime in TVM to work on this example, please make sure DNNL is available on your machine, and build the TVM with <code class="highlighter-rouge">set(USE_DNNL_CODEGEN ON)</code> in <code class="highlighter-rouge">config.cmake</code>.</p>
+
+<p>The DNNL codegen is implemented in <a href="https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc"><code class="highlighter-rouge">src/relay/backend/contrib/dnnl/codegen.cc</code></a>. Since we implemented DNNL codegen in both forms in this file for illustration purpose, you could focus on the part covered by <code class="highlighter-rouge">USE_JSON_RUNTIME</code> macro when tracing the code.</p>
+
+<p>We first register the codegen with TVM registration API (<a href="https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L510">L510</a>). This registration makes TVM compile engine dispatch the Relay function with <code class="highlighter-rouge">Compiler=&lt;your codegen&gt;</code>  to <code class="highlighter-rouge">relay.ext.&lt;your codegen&gt;</code>. Then we implement the entry function of the DNNL compiler  [...]
+
+<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">runtime</span><span class="o">::</span><span class="n">Module</span> <span class="nf">DNNLCompiler</span><span class="p">(</span><span class="k">const</span> <span class="n">ObjectRef</span><span class="o">&amp;</span> <span class="n">ref</span><span class="p">)</span> <span class="p">{</span>
+  <span class="c1">// "ref" should be the paritioned Relay function with kCompiler=dnnl.</span>
+  <span class="n">CHECK</span><span class="p">(</span><span class="n">ref</span><span class="o">-&gt;</span><span class="n">IsInstance</span><span class="o">&lt;</span><span class="n">FunctionNode</span><span class="o">&gt;</span><span class="p">());</span>
+  <span class="k">auto</span> <span class="n">func</span> <span class="o">=</span> <span class="n">Downcast</span><span class="o">&lt;</span><span class="n">Function</span><span class="o">&gt;</span><span class="p">(</span><span class="n">ref</span><span class="p">);</span>
+
+  <span class="c1">// Get the function name as the symbol to match in runtime.</span>
+  <span class="k">auto</span> <span class="n">func_name</span> <span class="o">=</span> <span class="n">GetExtSymbol</span><span class="p">(</span><span class="n">func</span><span class="p">);</span>
+
+  <span class="c1">// Serialize the function to a JSON string (introduce later).</span>
+  <span class="n">DNNLJSONSerializer</span> <span class="n">serializer</span><span class="p">(</span><span class="n">func_name</span><span class="p">,</span> <span class="n">func</span><span class="p">);</span>
+  <span class="n">serializer</span><span class="p">.</span><span class="n">serialize</span><span class="p">();</span>
+  <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">graph_json</span> <span class="o">=</span> <span class="n">serializer</span><span class="p">.</span><span class="n">GetJSON</span><span class="p">();</span>
+
+  <span class="c1">// The constant tensor names that have been bound to the module.</span>
+  <span class="c1">// All constant tensors will be serialzied along with the JSON graph</span>
+  <span class="c1">// when export_library is invoked.</span>
+  <span class="k">auto</span> <span class="n">params</span> <span class="o">=</span> <span class="n">serializer</span><span class="p">.</span><span class="n">GetParams</span><span class="p">();</span>
+
+  <span class="c1">// The function to create DNNL JSON runtime (introduce later).</span>
+  <span class="k">const</span> <span class="k">auto</span><span class="o">*</span> <span class="n">pf</span> <span class="o">=</span> <span class="n">runtime</span><span class="o">::</span><span class="n">Registry</span><span class="o">::</span><span class="n">Get</span><span class="p">(</span><span class="s">"runtime.DNNLJSONRuntimeCreate"</span><span class="p">);</span>
+  <span class="n">CHECK</span><span class="p">(</span><span class="n">pf</span> <span class="o">!=</span> <span class="n">nullptr</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="s">"Cannot find JSON runtime module to create"</span><span class="p">;</span>
+
+  <span class="c1">// Create a DNNL runtime module that can run the serialized function.</span>
+  <span class="k">auto</span> <span class="n">mod</span> <span class="o">=</span> <span class="p">(</span><span class="o">*</span><span class="n">pf</span><span class="p">)(</span><span class="n">func_name</span><span class="p">,</span> <span class="n">graph_json</span><span class="p">,</span> <span class="n">params</span><span class="p">);</span>
+  <span class="k">return</span> <span class="n">mod</span><span class="p">;</span>
+<span class="p">}</span>
+<span class="n">TVM_REGISTER_GLOBAL</span><span class="p">(</span><span class="s">"relay.ext.dnnl"</span><span class="p">).</span><span class="n">set_body_typed</span><span class="p">(</span><span class="n">DNNLCompiler</span><span class="p">);</span>
+</code></pre></div></div>
+
+<p>Note that <strong><em>each runtime module is only responsible for one Relay function, meaning that you may have several DNNL runtime modules in a single <code class="highlighter-rouge">.so</code> file.</em></strong></p>
+
+<h3 id="dnnl-json-serialization">DNNL JSON Serialization</h3>
+<p>Next, we implement DNNL JSON serializer (<a href="https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L429">L429</a>). We derived it from the BYOC JSON codegen (<a href="https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/codegen_json/codegen_json.h">src/relay/backend/contrib/codegen_json/codegen_json.h</a>). The special process in DNNL JSON serialize [...]
+
+<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
+  </span><span class="err">op:</span><span class="w"> </span><span class="s2">"kernel"</span><span class="p">,</span><span class="w">
+  </span><span class="err">name:</span><span class="w"> </span><span class="s2">"dnnl.conv2d_relu"</span><span class="p">,</span><span class="w">
+  </span><span class="err">inputs:</span><span class="w"> </span><span class="p">[[</span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="mi">0</span><span class="p">],</span><span class="w"> </span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class= [...]
+  </span><span class="err">attrs:</span><span class="w"> </span><span class="p">{</span><span class="w">
+    </span><span class="err">PartitionedFromPattern:</span><span class="w"> </span><span class="p">[</span><span class="s2">"nn.conv2d_nn.relu_"</span><span class="p">],</span><span class="w">
+    </span><span class="err">shape:</span><span class="w"> </span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">32</span><span class="p">,</span><span class="w"> </span><span class="mi">14</span><span class="p">,</span><span class="w"> </span><span class="mi">14</span><span class="p">]</span><span class="w">
+  </span><span class="p">}</span><span class="w">
+</span><span class="p">}</span><span class="w">
+</span></code></pre></div></div>
+<p>The problem is that we still need the Conv2D attributes such as padding and strides in runtime, but the BYOC JSON serializer only attaches the attributes of the composite function instead of the body operators. On the other hand, the customized DNNL JSON serializer attaches the attributes of the first and only Conv2D in the composite function to generate the following JSON node:</p>
+
+<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
+  </span><span class="err">op:</span><span class="w"> </span><span class="s2">"kernel"</span><span class="p">,</span><span class="w">
+  </span><span class="err">name:</span><span class="w"> </span><span class="s2">"dnnl.conv2d_relu"</span><span class="p">,</span><span class="w">
+  </span><span class="err">inputs:</span><span class="w"> </span><span class="p">[[</span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="mi">0</span><span class="p">],</span><span class="w"> </span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class= [...]
+  </span><span class="err">attrs:</span><span class="w"> </span><span class="p">{</span><span class="w">
+    </span><span class="err">shape:</span><span class="w"> </span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">32</span><span class="p">,</span><span class="w"> </span><span class="mi">14</span><span class="p">,</span><span class="w"> </span><span class="mi">14</span><span class="p">],</span><span class="w">
+    </span><span class="err">data_layout:</span><span class="w"> </span><span class="p">[</span><span class="s2">"NCHW"</span><span class="p">],</span><span class="w">
+    </span><span class="err">kernel_layout:</span><span class="w"> </span><span class="p">[</span><span class="s2">"OIHW"</span><span class="p">],</span><span class="w">
+    </span><span class="err">strides:</span><span class="w"> </span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">],</span><span class="w">
+    </span><span class="err">padding:</span><span class="w"> </span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">]</span><span class="w">
+  </span><span class="p">}</span><span class="w">
+</span><span class="p">}</span><span class="w">
+</span></code></pre></div></div>
+
+<p>As can be seen from the DNNL JSON serializer, you can customize the serializer to generate any forms in JSON you like as long as your JSON runtime could interpret them.</p>
+
+<h3 id="dnnl-json-runtime">DNNL JSON Runtime</h3>
+
+<p>We then implement a DNNL JSON runtime to interpret and execute the serialized JSON graph. We put it under <a href="https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/runtime/contrib/dnnl/dnnl_json_runtime.cc"><code class="highlighter-rouge">src/runtime/contrib/dnnl/dnnl_json_runtime.cc</code></a>.</p>
+
+<p>Again, we first register two APIs to create the runtime so that we can use them anywhere. The <code class="highlighter-rouge">runtime.DNNLJSONRuntimeCreate</code> is used in the previous part after serialization, and <code class="highlighter-rouge">runtime.module.loadbinary_dnnl_json</code> could be used when loading the <code class="highlighter-rouge">.so</code> back.</p>
+
+<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Create a DNNL JSON runtime to interpret and execute the given JSON graph.</span>
+<span class="n">runtime</span><span class="o">::</span><span class="n">Module</span> <span class="nf">DNNLJSONRuntimeCreate</span><span class="p">(</span><span class="n">String</span> <span class="n">symbol_name</span><span class="p">,</span> <span class="n">String</span> <span class="n">graph_json</span><span class="p">,</span>
+                                      <span class="k">const</span> <span class="n">Array</span><span class="o">&lt;</span><span class="n">String</span><span class="o">&gt;&amp;</span> <span class="n">const_names</span><span class="p">)</span> <span class="p">{</span>
+  <span class="k">auto</span> <span class="n">n</span> <span class="o">=</span> <span class="n">make_object</span><span class="o">&lt;</span><span class="n">DNNLJSONRuntime</span><span class="o">&gt;</span><span class="p">(</span><span class="n">symbol_name</span><span class="p">,</span> <span class="n">graph_json</span><span class="p">,</span> <span class="n">const_names</span><span class="p">);</span>
+  <span class="k">return</span> <span class="n">runtime</span><span class="o">::</span><span class="n">Module</span><span class="p">(</span><span class="n">n</span><span class="p">);</span>
+<span class="p">}</span>
+<span class="n">TVM_REGISTER_GLOBAL</span><span class="p">(</span><span class="s">"runtime.DNNLJSONRuntimeCreate"</span><span class="p">)</span>
+    <span class="p">.</span><span class="n">set_body_typed</span><span class="p">(</span><span class="n">DNNLJSONRuntimeCreate</span><span class="p">);</span>
+
+<span class="n">TVM_REGISTER_GLOBAL</span><span class="p">(</span><span class="s">"runtime.module.loadbinary_dnnl_json"</span><span class="p">)</span>
+    <span class="p">.</span><span class="n">set_body_typed</span><span class="p">(</span><span class="n">JSONRuntimeBase</span><span class="o">::</span><span class="n">LoadFromBinary</span><span class="o">&lt;</span><span class="n">DNNLJSONRuntime</span><span class="o">&gt;</span><span class="p">);</span>
+</code></pre></div></div>
+
+<p>Now we explain DNNL JSON runtime implementation. The basic class structure is:</p>
+
+<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">class</span> <span class="n">DNNLJSONRuntime</span> <span class="o">:</span> <span class="n">public</span> <span class="n">JSONRuntimeBase</span> <span class="p">{</span>
+  <span class="k">const</span>  <span class="kt">char</span><span class="o">*</span> <span class="n">type_key</span><span class="p">()</span> <span class="k">const</span> <span class="p">{</span> <span class="k">return</span>  <span class="s">"dnnl_json"</span><span class="p">;</span> <span class="p">}</span> 
+  <span class="kt">void</span> <span class="n">Init</span><span class="p">(</span><span class="k">const</span> <span class="n">Array</span><span class="o">&lt;</span><span class="n">NDArray</span><span class="o">&gt;&amp;</span> <span class="n">consts</span><span class="p">)</span> <span class="n">override</span> <span class="p">{</span>
+    <span class="c1">// Initialize the DNNL graph engine.</span>
+    <span class="n">BuildEngine</span><span class="p">();</span>
+    
+    <span class="c1">// Setup constants entries for weights.</span>
+    <span class="n">CHECK_EQ</span><span class="p">(</span><span class="n">consts</span><span class="p">.</span><span class="n">size</span><span class="p">(),</span> <span class="n">const_idx_</span><span class="p">.</span><span class="n">size</span><span class="p">())</span>
+      <span class="o">&lt;&lt;</span> <span class="s">"The number of input constants must match the number of required."</span><span class="p">;</span>
+    <span class="n">SetupConstants</span><span class="p">(</span><span class="n">consts</span><span class="p">);</span>
+  <span class="p">}</span>
+
+  <span class="kt">void</span> <span class="n">Run</span><span class="p">()</span> <span class="n">override</span> <span class="p">{</span>
+   <span class="c1">// 1. Fill in the input buffers.</span>
+   <span class="c1">// 2. Invoke the engine through intepreting the stream.</span>
+   <span class="c1">// 3. Read and fill output buffers.</span>
+  <span class="p">}</span>
+<span class="p">}</span>
+</code></pre></div></div>
+
+<p>The <code class="highlighter-rouge">Init</code> function is in charge of building the DNNL engine by interpreting the JSON graph string (see <a href="https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/runtime/contrib/dnnl/dnnl_json_runtime.cc#L93">L93</a> for <code class="highlighter-rouge">BuildEngine</code>), and filling the constant weights to the corresponding data entry buffers (the <code class="highlighter-rouge">SetupConstant</code> is imp [...]
+
+<p>Next, the <code class="highlighter-rouge">Run</code> function (<a href="https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/runtime/contrib/dnnl/dnnl_json_runtime.cc#L64">L64</a>) first writes the input tensors, which may come from user inputs or constant weights, to the corresponding DNNL memory buffers we initialized when building the DNNL engine. Then launch the DNNL engine to execute the JSON graph. Finally, it writes the DNNL output memory bu [...]
+
+<p>Since the rest implementation in DNNL JSON runtime are too DNNL specific to be dived into details in this post, we will stop here. We would like to emphasize that while the DNNL JSON runtime is a good reference to start with, your JSON runtime could be fully customized to fit your requirements.</p>
+
+<h2 id="bring-dnnl-to-tvm-c-source-codegen">Bring DNNL to TVM: C Source Codegen</h2>
+<p>Now let’s implement the DNNL codegen that generates C source code which invokes DNNL APIs to execute the Relay graph.<em>Note that if you attempt to implement a codegen to generate other graph representation like in JSON format, you may want to read <a href="#bring-dnnl-to-tvm-json-codegenruntime">Bring DNNL to TVM: JSON Codegen/Runtime</a> and skip this section.</em></p>
+
+<p>To enable DNNL C source codegen in TVM to work on this example, please make sure DNNL is available on your machine, and build the TVM with <code class="highlighter-rouge">set(USE_DNNL_CODEGEN C_SRC)</code> in <code class="highlighter-rouge">config.cmake</code>.</p>
+
+<p>The DNNL codegen is implemented in <a href="https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc"><code class="highlighter-rouge">src/relay/backend/contrib/dnnl/codegen.cc</code></a>. Since we implemented DNNL codegen in both forms in this file for illustration purpose, you could focus on the part <strong>NOT</strong> covered by <code class="highlighter-rouge">USE_JSON_RUNTIME</code> macro when tracing the code.</p>
+
+<p>We first register the codegen with TVM registration API (<a href="https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L510">L510</a>). This registration makes TVM compile engine dispatch the Relay function with <code class="highlighter-rouge">Compiler=&lt;your codegen&gt;</code>  to <code class="highlighter-rouge">relay.ext.&lt;your codegen&gt;</code>. Then we implement the entry function of the DNNL compiler  [...]
+
+<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">runtime</span><span class="o">::</span><span class="n">Module</span> <span class="nf">DNNLCompiler</span><span class="p">(</span><span class="k">const</span> <span class="n">ObjectRef</span><span class="o">&amp;</span> <span class="n">ref</span><span class="p">)</span> <span class="p">{</span>
+  <span class="n">DNNLModuleCodegen</span> <span class="n">dnnl</span><span class="p">;</span>
+  <span class="k">return</span> <span class="n">dnnl</span><span class="p">.</span><span class="n">CreateCSourceModule</span><span class="p">(</span><span class="n">ref</span><span class="p">);</span>
+<span class="p">}</span>
+<span class="n">TVM_REGISTER_GLOBAL</span><span class="p">(</span><span class="s">"relay.ext.dnnl"</span><span class="p">).</span><span class="n">set_body_typed</span><span class="p">(</span><span class="n">DNNLCompiler</span><span class="p">);</span>
+</code></pre></div></div>
+
+<p>Note that <strong><em>each runtime module is only responsible for one Relay function, meaning that you may have several DNNL runtime modules in a single <code class="highlighter-rouge">.so</code> file.</em></strong></p>
+
+<p>Then, we derive <code class="highlighter-rouge">CSourceModuleCodegenBase</code> to implement  <code class="highlighter-rouge">DNNLModuleCodegen</code> in <a href="https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L362">L362</a>. While <code class="highlighter-rouge">CSourceModuleCodegenBase</code> is in charge of other module level processes such as serialization, we only need to implement the DNNL code gene [...]
+
+<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">runtime</span><span class="o">::</span><span class="n">Module</span> <span class="n">CreateCSourceModule</span><span class="p">(</span><span class="k">const</span> <span class="n">ObjectRef</span><span class="o">&amp;</span> <span class="n">ref</span><span class="p">)</span> <span class="n">override</span> <span class="p">{</span>
+    <span class="c1">// Include headers</span>
+    <span class="c1">// ...skip...</span>
+    <span class="n">code_stream_</span> <span class="o">&lt;&lt;</span> <span class="s">"#include &lt;dnnl/dnnl_kernel.h&gt;</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
+    <span class="c1">// ...skip...</span>
+
+    <span class="c1">// "ref" should be the paritioned Relay function with kCompiler=dnnl.</span>
+    <span class="n">CHECK</span><span class="p">(</span><span class="n">ref</span><span class="o">-&gt;</span><span class="n">IsInstance</span><span class="o">&lt;</span><span class="n">FunctionNode</span><span class="o">&gt;</span><span class="p">());</span>
+    <span class="k">auto</span> <span class="n">res</span> <span class="o">=</span> <span class="n">GenDNNLFunc</span><span class="p">(</span><span class="n">Downcast</span><span class="o">&lt;</span><span class="n">Function</span><span class="o">&gt;</span><span class="p">(</span><span class="n">ref</span><span class="p">));</span>
+
+    <span class="c1">// "code" is the generated C code with DNNL APIs.</span>
+    <span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">code</span> <span class="o">=</span> <span class="n">code_stream_</span><span class="p">.</span><span class="n">str</span><span class="p">();</span>
+
+    <span class="c1">// "res" is a tuple of constant weights (symbols, values).</span>
+    <span class="c1">// All constant tensors will be serialzied along with the generated C code</span>
+    <span class="c1">// when export_library is invoked.</span>
+    <span class="n">String</span> <span class="n">sym</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">get</span><span class="o">&lt;</span><span class="mi">0</span><span class="o">&gt;</span><span class="p">(</span><span class="n">res</span><span class="p">);</span>
+    <span class="n">Array</span><span class="o">&lt;</span><span class="n">String</span><span class="o">&gt;</span> <span class="n">variables</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">get</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span><span class="p">(</span><span class="n">res</span><span class="p">);</span>
+
+    <span class="c1">// Create a CSource module with all above artifacts.</span>
+    <span class="k">const</span> <span class="k">auto</span><span class="o">*</span> <span class="n">pf</span> <span class="o">=</span> <span class="n">runtime</span><span class="o">::</span><span class="n">Registry</span><span class="o">::</span><span class="n">Get</span><span class="p">(</span><span class="s">"runtime.CSourceModuleCreate"</span><span class="p">);</span>
+    <span class="n">CHECK</span><span class="p">(</span><span class="n">pf</span> <span class="o">!=</span> <span class="n">nullptr</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="s">"Cannot find csource module to create the external runtime module"</span><span class="p">;</span>
+    <span class="k">return</span> <span class="p">(</span><span class="o">*</span><span class="n">pf</span><span class="p">)(</span><span class="n">code</span><span class="p">,</span> <span class="s">"c"</span><span class="p">,</span> <span class="n">sym</span><span class="p">,</span> <span class="n">variables</span><span class="p">);</span>
+  <span class="p">}</span>
+</code></pre></div></div>
+
+<p>Next, we implement <code class="highlighter-rouge">GenDNNLFunc</code> (<a href="https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L365">L365</a>) to generate the compilable C code with DNNL APIs as follows. Please see the embedded comments for the explanations of TVM C source runtime module compatible function interfaces.</p>
+
+<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// The example Relay graph: conv2d -&gt; add -&gt; relu.</span>
+<span class="cp">#include &lt;cstdint&gt;
+#include &lt;cstdlib&gt;
+#include &lt;cstring&gt;
+#include &lt;vector&gt;
+#include &lt;tvm/runtime/c_runtime_api.h&gt;
+#include &lt;tvm/runtime/container.h&gt;
+#include &lt;tvm/runtime/packed_func.h&gt;
+#include &lt;dlpack/dlpack.h&gt;
+#include &lt;dnnl/dnnl_kernel.h&gt;
+</span><span class="n">using</span> <span class="n">namespace</span> <span class="n">tvm</span><span class="o">::</span><span class="n">runtime</span><span class="p">;</span>
+<span class="n">using</span> <span class="n">namespace</span> <span class="n">tvm</span><span class="o">::</span><span class="n">runtime</span><span class="o">::</span><span class="n">contrib</span><span class="p">;</span>
+
+<span class="c1">// Execute the conv2d-&gt;add-&gt;relu graph with DNNL.</span>
+<span class="k">extern</span> <span class="s">"C"</span> <span class="kt">void</span> <span class="nf">dnnl_0_</span><span class="p">(</span><span class="kt">float</span><span class="o">*</span> <span class="n">dnnl_0_i0</span><span class="p">,</span> <span class="kt">float</span><span class="o">*</span> <span class="n">dnnl_0_i1</span><span class="p">,</span>
+                        <span class="kt">float</span><span class="o">*</span> <span class="n">dnnl_0_i2</span><span class="p">,</span> <span class="kt">float</span><span class="o">*</span> <span class="n">out0</span><span class="p">)</span> <span class="p">{</span>
+  <span class="c1">// Allocate intermediate buffers.</span>
+  <span class="kt">float</span><span class="o">*</span> <span class="n">buf_0</span> <span class="o">=</span> <span class="p">(</span><span class="kt">float</span><span class="o">*</span><span class="p">)</span><span class="n">std</span><span class="o">::</span><span class="n">malloc</span><span class="p">(</span><span class="mi">4</span> <span class="o">*</span> <span class="mi">4608</span><span class="p">);</span>
+  <span class="kt">float</span><span class="o">*</span> <span class="n">buf_1</span> <span class="o">=</span> <span class="p">(</span><span class="kt">float</span><span class="o">*</span><span class="p">)</span><span class="n">std</span><span class="o">::</span><span class="n">malloc</span><span class="p">(</span><span class="mi">4</span> <span class="o">*</span> <span class="mi">4608</span><span class="p">);</span>
+  <span class="kt">float</span><span class="o">*</span> <span class="n">buf_2</span> <span class="o">=</span> <span class="p">(</span><span class="kt">float</span><span class="o">*</span><span class="p">)</span><span class="n">std</span><span class="o">::</span><span class="n">malloc</span><span class="p">(</span><span class="mi">4</span> <span class="o">*</span> <span class="mi">4608</span><span class="p">);</span>
+
+  <span class="c1">// Pre-implemented op-based DNNL functions.</span>
+  <span class="n">dnnl_conv2d</span><span class="p">(</span><span class="n">dnnl_0_i0</span><span class="p">,</span> <span class="n">dnnl_0_i1</span><span class="p">,</span> <span class="n">buf_0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">14</span><span class="p">,</span> <span class="mi">14</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class [...]
+  <span class="n">dnnl_add</span><span class="p">(</span><span class="n">buf_0</span><span class="p">,</span> <span class="n">dnnl_0_i2</span><span class="p">,</span> <span class="n">buf_1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">12</span><span class="p">);</span>
+  <span class="n">dnnl_relu</span><span class="p">(</span><span class="n">buf_1</span><span class="p">,</span> <span class="n">buf_2</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">12</span><span class="p">);</span>
+
+  <span class="c1">// Copy the final output to the corresponding buffer.</span>
+  <span class="n">std</span><span class="o">::</span><span class="n">memcpy</span><span class="p">(</span><span class="n">out0</span><span class="p">,</span> <span class="n">buf_2</span><span class="p">,</span> <span class="mi">4</span> <span class="o">*</span> <span class="mi">4608</span><span class="p">);</span>
+  <span class="n">std</span><span class="o">::</span><span class="n">free</span><span class="p">(</span><span class="n">buf_0</span><span class="p">);</span>
+  <span class="n">std</span><span class="o">::</span><span class="n">free</span><span class="p">(</span><span class="n">buf_1</span><span class="p">);</span>
+  <span class="n">std</span><span class="o">::</span><span class="n">free</span><span class="p">(</span><span class="n">buf_2</span><span class="p">);</span>
+<span class="p">}</span>
+
+<span class="c1">// The wrapper function with all arguments in DLTensor type.</span>
+<span class="k">extern</span> <span class="s">"C"</span> <span class="kt">int</span> <span class="nf">dnnl_0_wrapper_</span><span class="p">(</span><span class="n">DLTensor</span><span class="o">*</span> <span class="n">arg0</span><span class="p">,</span>
+        <span class="n">DLTensor</span><span class="o">*</span> <span class="n">arg1</span><span class="p">,</span>
+        <span class="n">DLTensor</span><span class="o">*</span> <span class="n">arg2</span><span class="p">,</span>
+        <span class="n">DLTensor</span><span class="o">*</span> <span class="n">out0</span><span class="p">)</span> <span class="p">{</span>
+
+  <span class="c1">// Cast all DLTensor to primitive type buffers and invoke the above</span>
+  <span class="c1">// execution function.</span>
+  <span class="n">dnnl_0_</span><span class="p">(</span><span class="n">static_cast</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">*&gt;</span><span class="p">(</span><span class="n">arg0</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">),</span>
+  <span class="n">static_cast</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">*&gt;</span><span class="p">(</span><span class="n">arg1</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">),</span>
+  <span class="n">static_cast</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">*&gt;</span><span class="p">(</span><span class="n">arg2</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">),</span>
+  <span class="n">static_cast</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">*&gt;</span><span class="p">(</span><span class="n">out0</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">));</span>
+  <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
+<span class="p">}</span>
+
+<span class="c1">// The TVM macro to generate TVM runtime compatible function "dnnl_0"</span>
+<span class="c1">// from our generated "dnnl_0_wrapper_".</span>
+<span class="n">TVM_DLL_EXPORT_TYPED_FUNC</span><span class="p">(</span><span class="n">dnnl_0</span><span class="p">,</span> <span class="n">dnnl_0_wrapper_</span><span class="p">);</span>
+</code></pre></div></div>
+
+<p>Note that the pre-implemented op-based DNNL functions are in <a href="https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/runtime/contrib/dnnl/dnnl.cc">src/runtime/contrib/dnnl/dnnl.cc</a>.</p>
+
+<p>Since the rest implementation in <a href="https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc"><code class="highlighter-rouge">src/relay/backend/contrib/dnnl/codegen.cc</code></a> are too DNNL specific to be dived into details in this post, we will stop here. The main idea is implementing a Relay graph visitor (<a href="https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/r [...]
+
+<h3 id="c-source-compilation">C Source Compilation</h3>
+<p>As you may have noticed, the output of <code class="highlighter-rouge">DNNLCompiler</code> is a module with the generated C code in text format, which has not been compiled by <code class="highlighter-rouge">gcc</code> to be executable binary. In fact, the generated C code will be compiled when users call <code class="highlighter-rouge">export_libray(mod)</code>, like the following code snippet:</p>
+
+<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">update_lib</span><span class="p">(</span><span class="n">lib</span><span class="p">):</span>
+    <span class="c1"># Include the path of src/runtime/contrib/dnnl/dnnl.cc
+</span>    <span class="n">test_dir</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">dirname</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">realpath</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n" [...]
+    <span class="n">source_dir</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">test_dir</span><span class="p">,</span> <span class="s">".."</span><span class="p">,</span> <span class="s">".."</span><span class="p">,</span> <span class="s">".."</span><span class="p">)</span>
+    <span class="n">contrib_path</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">source_dir</span><span class="p">,</span> <span class="s">"src"</span><span class="p">,</span> <span class="s">"runtime"</span><span class="p">,</span> <span class="s">"contrib"</span><span class="p">)</span>
+
+    <span class="c1"># Setup the gcc flag to compile DNNL code.
+</span>    <span class="n">kwargs</span> <span class="o">=</span> <span class="p">{}</span>
+    <span class="n">kwargs</span><span class="p">[</span><span class="s">"options"</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="s">"-O2"</span><span class="p">,</span> <span class="s">"-std=c++14"</span><span class="p">,</span> <span class="s">"-I"</span> <span class="o">+</span> <span class="n">contrib_path</span><span class="p">]</span>
+    <span class="n">tmp_path</span> <span class="o">=</span> <span class="n">util</span><span class="o">.</span><span class="n">tempdir</span><span class="p">()</span>
+    <span class="n">lib_name</span> <span class="o">=</span> <span class="s">'lib.so'</span>
+    <span class="n">lib_path</span> <span class="o">=</span> <span class="n">tmp_path</span><span class="o">.</span><span class="n">relpath</span><span class="p">(</span><span class="n">lib_name</span><span class="p">)</span>
+
+    <span class="c1"># The generated C code with DNNL APIs is compiled to a binary lib.so.
+</span>    <span class="n">lib</span><span class="o">.</span><span class="n">export_library</span><span class="p">(</span><span class="n">lib_path</span><span class="p">,</span> <span class="n">fcompile</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
+
+    <span class="c1"># Load the lib.so back to a runtime module.
+</span>    <span class="n">lib</span> <span class="o">=</span> <span class="n">runtime</span><span class="o">.</span><span class="n">load_module</span><span class="p">(</span><span class="n">lib_path</span><span class="p">)</span>
+    <span class="k">return</span> <span class="n">lib</span>
+
+<span class="k">with</span> <span class="n">tvm</span><span class="o">.</span><span class="n">transform</span><span class="o">.</span><span class="n">PassContext</span><span class="p">(</span><span class="n">opt_level</span><span class="o">=</span><span class="mi">3</span><span class="p">):</span>
+    <span class="n">json</span><span class="p">,</span> <span class="n">lib</span><span class="p">,</span> <span class="n">param</span> <span class="o">=</span> <span class="n">relay</span><span class="o">.</span><span class="n">build</span><span class="p">(</span><span class="n">mod</span><span class="p">,</span> <span class="n">target</span><span class="o">=</span><span class="n">target</span><span class="p">,</span> <span class="n">params</span><span class="o">=</span><span class="n"> [...]
+<span class="n">lib</span> <span class="o">=</span> <span class="n">update_lib</span><span class="p">(</span><span class="n">lib</span><span class="p">)</span>
+<span class="n">rt_mod</span> <span class="o">=</span> <span class="n">tvm</span><span class="o">.</span><span class="n">contrib</span><span class="o">.</span><span class="n">graph_runtime</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">json</span><span class="p">,</span> <span class="n">lib</span><span class="p">,</span> <span class="n">ctx</span><span class="p">)</span>
+</code></pre></div></div>
+
+<h2 id="bring-dnnl-to-tvm-build-tvm-with-dnnl-codegenruntime">Bring DNNL to TVM: Build TVM with DNNL Codegen/Runtime</h2>
+<p>Finally, we create <a href="https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/cmake/modules/contrib/DNNL.cmake">cmake/modules/contrib/DNNL.cmake</a> to include the DNNL codegen when building TVM. For demonstration purpose our DNNL codegen has two implementations in the same cmake file. You can only focus on one of them based on your need.</p>
+
+<p>With the cmake file ready, now users can specify <code class="highlighter-rouge">set(USE_DNNL_CODEGEN ON)</code> in their <code class="highlighter-rouge">build/config.cmake</code> to enable the DNNL codegen.</p>
+
+<hr />
+<ul>
+  <li>
+    <p><a href="https://github.com/zhiics">Zhi Chen</a> is a TVM PMC member as well as a senior engineer at SageMaker Neo, Amazon AI, AWS.</p>
+  </li>
+  <li>
+    <p><a href="https://comaniac.github.io">Cody Yu</a> is a TVM reviewer as well as an applied scientist at Amazon AI, AWS.</p>
+  </li>
+</ul>
+
+<h2 id="acknowledgment">Acknowledgment</h2>
+
+<p>We would like to thank our colleague Animesh Jain for valuable discussions in the framework design; Tianqi Chen and Jared Roesch from OctoML for system design discussions and prototyping; Masahiro Masuda from the TVM community to help code review and improve the DNNL integration. We would also like to thank Ramana Radhakrishnan, Matthew Barrett, Manupa Karunaratne, and Luke Hutton from ARM, U.K. for contributing several helpful ideas, related Relay passes, and the Arm Compute Library  [...]
+
+
+    </div>
+  </div>
+</div>
+</div>
+
+
+    
+
+
+
+
+
+    <div class="container">
+
+      <footer class="small">
+        Apache TVM is an effort undergoing incubation at The Apache Software Foundation (ASF),
+        sponsored by the <i>Apache Incubator</i>. Incubation is required
+        of all newly accepted projects until a further review indicates that the infrastructure,
+        communications, and decision making process have stabilized in a manner consistent with other
+        successful ASF projects. While incubation status is not necessarily a reflection of the completeness
+        or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.
+
+        Copyright © 2020 The Apache Software Foundation. Apache TVM, Apache,
+        the Apache feather, and the Apache TVM project logo are either trademarks or registered trademarks of the Apache Software Foundation.
+
+        See also other useful <a href="/asf" class="footer-link">ASF links</a>:
+        <a href="https://www.apache.org/" class="footer-link">Apache Homepage</a>,
+        <a href="https://www.apache.org/licenses/" class="footer-link">License</a>
+        <a href="https://www.apache.org/foundation/sponsorship.html" class="footer-link">Sponsorship</a>,
+        <a href="https://www.apache.org/security/" class="footer-link">Security</a>
+        <a href="https://www.apache.org/foundation/thanks.html" class="footer-link">Thanks</a>,
+        <a href="https://www.apache.org/events/current-event.html" class="footer-link">Current Event</a>
+
+      </footer>
+    </div>
+  </body>
+</html>
+
+
diff --git a/atom.xml b/atom.xml
index e105120..92d9ce6 100644
--- a/atom.xml
+++ b/atom.xml
@@ -4,7 +4,7 @@
  <title>TVM</title>
  <link href="https://tvm.apache.org" rel="self"/>
  <link href="https://tvm.apache.org"/>
- <updated>2020-07-14T09:12:02-07:00</updated>
+ <updated>2020-07-15T09:54:04-07:00</updated>
  <id>https://tvm.apache.org</id>
  <author>
    <name></name>
@@ -13,6 +13,485 @@
 
  
  <entry>
+   <title>How to Bring Your Own Codegen to TVM</title>
+   <link href="https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm"/>
+   <updated>2020-07-15T00:00:00-07:00</updated>
+   <id>https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm</id>
+   <content type="html">&lt;p&gt;To free data scientists from worrying about the performance when developing a new model, hardware backend providers (e.g., Intel, NVIDIA, ARM, etc) either provide kernel libraries such as cuBLAS or cuDNN with many commonly used deep learning kernels, or provide frameworks such as DNNL or TensorRT with a graph engine to let users describe their models in a certain way to achieve high performance. In addition, emerging deep learning accelerators also have t [...]
+
+&lt;p&gt;However, users have to learn a new programming interface when they attempt to work on a new kernel library or a device. As a result, the demand for a unified programming interface becomes more and more important to let all users and hardware backend providers stand on the same page.&lt;/p&gt;
+
+&lt;p&gt;To share the programming interface with widely used deep learning frameworks, many hardware device providers have attempted to integrate their devices backend to TensorFlow. However, since TensorFlow does not provide an official backend interface for new backends, you have to hack the TensorFlow for registration, which involves many source file changes and makes the future maintenance difficult.&lt;/p&gt;
+
+&lt;p&gt;In this post, we demonstrate how you, as a hardware backend provider, can easily leverage the Bring Your Own Codegen (BYOC) framework to integrate the kernel library/compiler/framework of your hardware device to TVM. The most important advantage of leveraging BYOC framework is that &lt;strong&gt;&lt;em&gt;all related source files of your devices are self-contained, so the codegen/runtime of your devices are pluggable to the TVM code base.&lt;/em&gt;&lt;/strong&gt; It means that  [...]
+
+&lt;p&gt;In the rest of this post, we first illustrate a scenario that you may need TVM with BYOC, followed by an overview of the BYOC compilation and runtime flows. Then, we step-by-step illustrate how to integrate a vendor library or an execution engine to TVM with BYOC by using Intel DNNL (a.k.a. MKL-DNN, OneDNN) as a running example.&lt;/p&gt;
+
+&lt;h2 id=&quot;bring-an-asic-accelerator-to-tvm&quot;&gt;Bring an ASIC Accelerator to TVM&lt;/h2&gt;
+
+&lt;p&gt;Let’s first make a scenario to illustrate why you want to bring your accelerator to TVM and what features you can expect from the BYOC framework. If you are not sure whether your case is suitable for BYOC, you are welcome to raise a discussion at &lt;a href=&quot;https://discuss.tvm.ai&quot;&gt;discuss.tvm.ai&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;Imagining that you just made an edge device platform with an ARM CPU and a fantastic accelerator that has achieved amazing performance for common image classification models. In other words, your accelerator does well on Conv2D, ReLU, GEMM, and other widely used CNN operators.&lt;/p&gt;
+
+&lt;p&gt;Unfortunately, object detection models are getting more and more popular as well, and your customers need to run both image classification and object detection models on your platform. Although your accelerator is capable of executing almost all operators in object detection models, one operator (e.g., non-maximum suppression, NMS) is missing.&lt;/p&gt;
+
+&lt;h3 id=&quot;let-tvm-execute-unsupported-operators&quot;&gt;Let TVM execute unsupported operators&lt;/h3&gt;
+&lt;p&gt;Since TVM has multiple codegens for different backends, it is easy for the open source community to implement new operators on CPU or GPU in a short time. Ideally, if you integrate the compilation flow of your accelerator to TVM with BYOC, TVM will perform Relay graph partitioning to offload a part of the graph to your accelerator while keeping others on TVM. As a result, you can claim that your platform is capable of running all models without worrying about new operators.&lt;/p&gt;
+
+&lt;h3 id=&quot;customize-graph-level-optimization&quot;&gt;Customize graph-level optimization&lt;/h3&gt;
+&lt;p&gt;Your ASIC accelerator must have its own compilation flow. Usually, it could be one of the following cases:&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Generate a graph representation and feed it to a graph engine&lt;/strong&gt;:
+You may have your own graph engine that is capable of executing a graph (or a neural network model) on your accelerator. For example, both Intel DNNL and NVIDIA TensorRT use an engine to run a whole graph or a model, so that they are able to 1) reduce memory transaction between operators and 2) optimize graph execution with operator fusion.&lt;/p&gt;
+
+&lt;p&gt;In order to achieve the above two optimizations, you may need to process the graph during the compilation time. For example, Conv2D and bias addition are two separate operators in TVM, but they may be one operator (Conv2D with bias addition capability) on your accelerator. In this case, you may want to optimize the graph by replacing the &lt;code class=&quot;highlighter-rouge&quot;&gt;conv2d - add&lt;/code&gt; graph pattern to a &lt;code class=&quot;highlighter-rouge&quot;&gt;yo [...]
+
+&lt;p&gt;If your compilation flow falls into this case, then we recommend reading all the rest sections in this post but skipping &lt;a href=&quot;#bring-dnnl-to-tvm-c-source-codegen&quot;&gt;Bring DNNL to TVM: C Source Codegen&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Generate assembly code and compile it to an executable binary&lt;/strong&gt;:
+If you do not have an end-to-end execution framework for your platform like the previous case, you may have a compiler to compile a program in assembly code of your ISA. In order to feed the assembly code to your compiler, you will need a codegen to generate and optimize the assembly code from a Relay graph.&lt;/p&gt;
+
+&lt;p&gt;If your compilation flow falls into this case, then we recommend reading all the rest sections in this post but skipping &lt;a href=&quot;#bring-dnnl-to-tvm-json-codegenruntime&quot;&gt;Bring DNNL to TVM: JSON Codegen/Runtime&lt;/a&gt;.&lt;/p&gt;
+
+&lt;h2 id=&quot;how-byoc-works&quot;&gt;How BYOC Works&lt;/h2&gt;
+
+&lt;p&gt;We then briefly explain how BYOC framework works. For more detail explanations of underlying framework components and their implementations, please refer to the &lt;a href=&quot;[https://tvm.apache.org/docs/dev/relay_bring_your_own_codegen.html](https://tvm.apache.org/docs/dev/relay_bring_your_own_codegen.html)&quot;&gt;developer document&lt;/a&gt;. In short, given a Relay graph in Figure 1, BYOC framework does the following steps:&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/bring-your-own-codegen/original_graph.png&quot; alt=&quot;The original Relay graph&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt;
+Figure 1: The Original Relay Graph.
+&lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;1-graph-annotation&quot;&gt;1. Graph Annotation&lt;/h3&gt;
+&lt;p&gt;Taking a user-provided Relay graph, our first step is to annotate the nodes that potentially can be offloaded to your accelerator in the graph. You will need to follow &lt;a href=&quot;#bring-dnnl-to-tvm-annotation-rules&quot;&gt;Bring DNNL to TVM: Annotation Rules&lt;/a&gt; to implement a whitelist of supported operators, or a graph pattern list of customized composite operators. An example annotation result is shown in Figure 2.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/bring-your-own-codegen/after_annotation.png&quot; alt=&quot;The Graph with Annotations&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt;
+Figure 2: The Graph with Annotations.
+&lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;2-graph-transformation&quot;&gt;2. Graph Transformation&lt;/h3&gt;
+&lt;p&gt;The second step is to transform and optimize the graph based on the annotations. Specifically, BYOC performs the following transformations.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;2.1: Merge compiler region&lt;/strong&gt;: As can be seen in Figure 2, we now have many “regions” in the graph that can be offloaded to your accelerator, but some of them can actually be merged to reduce the data transfer and kernel launching overhead. Accordingly, step 2.1 uses a greedy algorithm to merge as many of those regions as possible while guaranteeing the functional correctness. The result is depicted in Figure 3.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/bring-your-own-codegen/after_merging_regions.png&quot; alt=&quot;After Merging Compiler Regions&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt;
+Figure 3: After Merging Compiler Regions.
+&lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;2.2: Partition Graph&lt;/strong&gt;: For each region from the previous step, we create a Relay function with an attribute &lt;code class=&quot;highlighter-rouge&quot;&gt;Compiler&lt;/code&gt; to indicate that this Relay function should be entirely offloaded to your accelerator, as shown in Figure 4.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/bring-your-own-codegen/after_partitioning.png&quot; alt=&quot;After Graph Partitioning&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt;
+Figure 4: After Graph Partitioning.
+&lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;3-code-generation&quot;&gt;3. Code Generation&lt;/h3&gt;
+&lt;p&gt;Now we know which part of the Relay graph should be offloaded. In this step, we sequentially send every Relay function with &lt;code class=&quot;highlighter-rouge&quot;&gt;Compiler=your_accelerator&lt;/code&gt; to your codegen. Your codegen should compile the Relay function to the form that matches your own compilation flow. It can be either C source code or any text formats.&lt;/p&gt;
+
+&lt;p&gt;Finally, all compiled functions will be serialized along with other non-offloaded Relay functions to a single &lt;code class=&quot;highlighter-rouge&quot;&gt;.so&lt;/code&gt; file by the TVM &lt;code class=&quot;highlighter-rouge&quot;&gt;export_library&lt;/code&gt; Python API. In other words, the user will get only one &lt;code class=&quot;highlighter-rouge&quot;&gt;.so&lt;/code&gt; file after running this flow.&lt;/p&gt;
+
+&lt;h3 id=&quot;4-runtime&quot;&gt;4. Runtime&lt;/h3&gt;
+&lt;p&gt;You may also need to implement a runtime to initialize your graph engine (if applicable) and execute the compiled functions. During the inference, TVM runtime (i.e., graph runtime or VM) will leverage your runtime to invoke the offloaded functions when the TVM runtime encounters the corresponding function call in Figure 4. Your runtime is responsible for launching the compiled function with the given input tensor arrays and filling in the results to the output tensor arrays.&lt;/p&gt;
+
+&lt;p&gt;In the rest of this post, we use DNNL as an example to demonstrate how to achieve the above workflow using the BYOC framework. Please note that all referred code and line number in this post are based on the TVM repository’s master branch commit &lt;a href=&quot;https://github.com/apache/incubator-tvm/tree/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8&quot;&gt;8a0249c&lt;/a&gt;.&lt;/p&gt;
+
+&lt;h2 id=&quot;bring-dnnl-to-tvm-annotation-rules&quot;&gt;Bring DNNL to TVM: Annotation Rules&lt;/h2&gt;
+
+&lt;p&gt;The BYOC framework provides two approaches for you to describe the supported operators and patterns. You can use both of them simultaneously. In this section, we use DNNL as an example to show how to make use of them. The complete implementation is available &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/python/tvm/relay/op/contrib/dnnl.py&quot;&gt;here&lt;/a&gt;. Note that we put the annotation rules for your codegen under [...]
+
+&lt;h3 id=&quot;rules-for-single-operators&quot;&gt;Rules for single operators&lt;/h3&gt;
+&lt;p&gt;You can intuitively specify which Relay operators are supported by your accelerator with the BYOC API. For example, we use the following code snippet to build a rule saying that our DNNL codegen supports Conv2D:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;register_op_attr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt [...]
+&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;_dnnl_conv2d_wrapper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;attrs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+&lt;p&gt;This registers a new attribute &lt;code class=&quot;highlighter-rouge&quot;&gt;target.dnnl&lt;/code&gt; to Relay &lt;code class=&quot;highlighter-rouge&quot;&gt;nn.conv2d&lt;/code&gt; operator.  By this way, the BYOC annotation could invoke &lt;code class=&quot;highlighter-rouge&quot;&gt;target.dnnl()&lt;/code&gt; for every operator in the graph to check if it is supported in DNNL codegen.&lt;/p&gt;
+
+&lt;p&gt;On the other hand, it might be tedious to write the above code snippet for every single operator. For the DNNL implementation, we implemented a helper function, &lt;code class=&quot;highlighter-rouge&quot;&gt;_register_external_op_helper&lt;/code&gt;, to make our life easier:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;supported&lt;/span&gt;&lt;span class=&q [...]
+    &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;register_op_attr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;target.dnnl [...]
+    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;_func_wrapper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;attrs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
+        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;supported&lt;/span&gt;
+    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_func_wrapper&lt;/span&gt;
+
+&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;nn.batch_norm&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;nn.conv2d&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;nn.dense&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;nn.relu&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;add&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;subtract&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;multiply&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+&lt;p&gt;In the above example, we specify a list of operators that can be supported by DNNL codegen.&lt;/p&gt;
+
+&lt;h3 id=&quot;rules-for-graph-patterns&quot;&gt;Rules for graph patterns&lt;/h3&gt;
+&lt;p&gt;Your accelerator or compiler may have optimized some patterns (e.g., Conv2D + add + ReLU) to be a single instruction or an API. In this case, you can specify a mapping from a graph pattern to your instruction/API. For the case of the DNNL, its Conv2D API already includes bias addition and it allows the next ReLU to be attached, so we can call DNNL as the following code snippet (the complete implementation can be found &lt;a href=&quot;[https://github.com/apache/incubator-tvm/blo [...]
+
+&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;DNNLConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;has_bias&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;false&lt [...]
+  &lt;span class=&quot;c1&quot;&gt;// ... skip ...&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv_desc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;convolution_forward&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;desc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prop_kind&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;forward_inference&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;algorithm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;convolution_direct&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;conv_src_md&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv_weights_md&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv_bias_md&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv_dst_md&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;strides_dims&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;padding_dims_l&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;padding_dims_r&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+
+  &lt;span class=&quot;c1&quot;&gt;// Attach ReLU&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;primitive_attr&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;attr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;has_relu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;post_ops&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append_eltwise&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;spa [...]
+    &lt;span class=&quot;n&quot;&gt;attr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_post_ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+
+  &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv2d_prim_desc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;convolution_forward&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;primitive_desc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;conv_desc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;attr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;engine_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+  &lt;span class=&quot;c1&quot;&gt;// ... skip ...&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+&lt;p&gt;In this case, except for a single &lt;code class=&quot;highlighter-rouge&quot;&gt;conv2d&lt;/code&gt;, we would like to map the graph pattern &lt;code class=&quot;highlighter-rouge&quot;&gt;conv2d+relu&lt;/code&gt; to &lt;code class=&quot;highlighter-rouge&quot;&gt;DNNLConv2d(false, true)&lt;/code&gt;, and map &lt;code class=&quot;highlighter-rouge&quot;&gt;conv2d+add+relu&lt;/code&gt; to &lt;code class=&quot;highlighter-rouge&quot;&gt;DNNLConv2d(true, true)&lt;/code&gt;. We can [...]
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;make_pattern&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_bias&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&l [...]
+  &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wildcard&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wildcard&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;bias&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wildcard&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;conv&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'nn.conv2d'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;with_bias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;conv_out&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'add'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;conv_out&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'nn.relu'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conv_out&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+
+&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;register_pattern_table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;dnnl&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;pattern_table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;conv2d_bias_relu_pat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;dnnl.conv2d_bias_relu&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;make_pattern&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_bias&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt [...]
+  &lt;span class=&quot;n&quot;&gt;conv2d_relu_pat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;dnnl.conv2d_relu&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;make_pattern&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_bias&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span clas [...]
+  &lt;span class=&quot;n&quot;&gt;dnnl_patterns&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conv2d_bias_relu_pat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv2d_relu_pat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_patterns&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;In the DNNL example, we implemented two patterns with different names so that we can easily recognize them in the codegen. Note that the patterns are implemented in the Relay pattern language. You can follow &lt;a href=&quot;https://tvm.apache.org/docs/langref/relay_pattern.html&quot;&gt;this tutorial&lt;/a&gt; to learn how to write your own patterns.&lt;/p&gt;
+
+&lt;p&gt;With the pattern table, we can then use a Relay pass to perform the transformation from&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;%1 = nn.conv2d(%data, %weight, ...)
+%2 = add(%1, %bias)
+%3 = nn.relu(%2)
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+&lt;p&gt;to&lt;/p&gt;
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;%1 = fn(%input1, %input2, %input3,
+        Composite=&quot;dnnl.conv2d_bias_relu&quot;,
+        PartitionedFromPattern=&quot;nn.conv2d_add_nn.relu_&quot;) {
+  %1 = nn.conv2d(%input1, %input2, ...)
+  %2 = add(%1, %input3)
+  nn.relu(%2)
+}
+%2 = %1(%data, %weight, %bias)
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+&lt;p&gt;Thus, the DNNL codegen can get the pattern name &lt;code class=&quot;highlighter-rouge&quot;&gt;conv2d_bias_relu&lt;/code&gt; and map &lt;code class=&quot;highlighter-rouge&quot;&gt;%1&lt;/code&gt; to &lt;code class=&quot;highlighter-rouge&quot;&gt;DNNLConv2d(true, true)&lt;/code&gt;.&lt;/p&gt;
+
+&lt;p&gt;As you may have noticed that we also have an attribute called “PartitionedFromPattern” in the composite function. This could be helpful if your pattern contains &lt;code class=&quot;highlighter-rouge&quot;&gt;wildcard&lt;/code&gt; operators. For example we may have a pattern table &lt;code class=&quot;highlighter-rouge&quot;&gt;(&quot;conv2d_with_something&quot;, conv2d -&amp;gt; *)&lt;/code&gt;:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;make_pattern&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_bias&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&l [...]
+  &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wildcard&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wildcard&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;conv&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'nn.conv2d'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wildcard&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+&lt;p&gt;In this case, you will get a composite function with &lt;code class=&quot;highlighter-rouge&quot;&gt;Composite=conv2d_with_something&lt;/code&gt;, but you have no idea about what graph it actually matched. That’s where PartitionedFromPattern comes into play. You can know that if the matched graph is &lt;code class=&quot;highlighter-rouge&quot;&gt;conv2d -&amp;gt; add&lt;/code&gt; or &lt;code class=&quot;highlighter-rouge&quot;&gt;conv2d -&amp;gt; relu&lt;/code&gt; by looking at  [...]
+
+&lt;h2 id=&quot;bring-dnnl-to-tvm-relay-graph-transformation&quot;&gt;Bring DNNL to TVM: Relay Graph Transformation&lt;/h2&gt;
+&lt;p&gt;With the annotation rules from the previous step, we can now apply a list of BYOC Relay passes to transform the Relay graph from Figure 1 to Figure 4:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;create_relay_module_from_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# Output: Figure 1
+&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;MergeComposite&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pattern_table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;p&qu [...]
+&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnnotateTarget&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;dnnl&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;) [...]
+&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;MergeCompilerRegions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# Output: Figure 3
+&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PartitionGraph&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# Output: Figure 4
+&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+&lt;p&gt;As can be seen, each Relay pass can be mapped to a step we have introduced in &lt;a href=&quot;#how-byoc-works&quot;&gt;How BYOC Works&lt;/a&gt;.&lt;/p&gt;
+
+&lt;h2 id=&quot;bring-dnnl-to-tvm-json-codegenruntime&quot;&gt;Bring DNNL to TVM: JSON Codegen/Runtime&lt;/h2&gt;
+&lt;p&gt;Now let’s implement the DNNL codegen that serializes a Relay graph to a JSON representation, and then implement the DNNL JSON runtime to deserialize and execute the graph. &lt;em&gt;Note that if you attempt to implement a codegen to generate C-compatible programs, you may want to directly proceed to the next section.&lt;/em&gt;&lt;/p&gt;
+
+&lt;p&gt;To enable DNNL JSON codegen/runtime in TVM to work on this example, please make sure DNNL is available on your machine, and build the TVM with &lt;code class=&quot;highlighter-rouge&quot;&gt;set(USE_DNNL_CODEGEN ON)&lt;/code&gt; in &lt;code class=&quot;highlighter-rouge&quot;&gt;config.cmake&lt;/code&gt;.&lt;/p&gt;
+
+&lt;p&gt;The DNNL codegen is implemented in &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;src/relay/backend/contrib/dnnl/codegen.cc&lt;/code&gt;&lt;/a&gt;. Since we implemented DNNL codegen in both forms in this file for illustration purpose, you could focus on the part covered by &lt;code class=&quot;highlighter-rouge&quot;&gt;USE_JS [...]
+
+&lt;p&gt;We first register the codegen with TVM registration API (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L510&quot;&gt;L510&lt;/a&gt;). This registration makes TVM compile engine dispatch the Relay function with &lt;code class=&quot;highlighter-rouge&quot;&gt;Compiler=&amp;lt;your codegen&amp;gt;&lt;/code&gt;  to &lt;code class=&quot;highlighter-rouge&quot;&gt;relay.ext.&amp;lt;your  [...]
+
+&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;DNNLCompiler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Object [...]
+  &lt;span class=&quot;c1&quot;&gt;// &quot;ref&quot; should be the paritioned Relay function with kCompiler=dnnl.&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;CHECK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ref&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IsInstance&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FunctionNode&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;());&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;func&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Downcast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Function&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ref&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt [...]
+
+  &lt;span class=&quot;c1&quot;&gt;// Get the function name as the symbol to match in runtime.&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;func_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GetExtSymbol&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;func&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+
+  &lt;span class=&quot;c1&quot;&gt;// Serialize the function to a JSON string (introduce later).&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;DNNLJSONSerializer&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;serializer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;func_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;func&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;serializer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;serialize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;graph_json&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;serializer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetJSON&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
+
+  &lt;span class=&quot;c1&quot;&gt;// The constant tensor names that have been bound to the module.&lt;/span&gt;
+  &lt;span class=&quot;c1&quot;&gt;// All constant tensors will be serialzied along with the JSON graph&lt;/span&gt;
+  &lt;span class=&quot;c1&quot;&gt;// when export_library is invoked.&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;serializer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetParams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
+
+  &lt;span class=&quot;c1&quot;&gt;// The function to create DNNL JSON runtime (introduce later).&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Registry&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Get&lt;/span&gt;& [...]
+  &lt;span class=&quot;n&quot;&gt;CHECK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nullptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Cannot find JSON runtime module to create&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+
+  &lt;span class=&quot;c1&quot;&gt;// Create a DNNL runtime module that can run the serialized function.&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;func_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;graph_json&lt;/span&gt;&l [...]
+  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;TVM_REGISTER_GLOBAL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;relay.ext.dnnl&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_body_typed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNLCompiler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Note that &lt;strong&gt;&lt;em&gt;each runtime module is only responsible for one Relay function, meaning that you may have several DNNL runtime modules in a single &lt;code class=&quot;highlighter-rouge&quot;&gt;.so&lt;/code&gt; file.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;dnnl-json-serialization&quot;&gt;DNNL JSON Serialization&lt;/h3&gt;
+&lt;p&gt;Next, we implement DNNL JSON serializer (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L429&quot;&gt;L429&lt;/a&gt;). We derived it from the BYOC JSON codegen (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/codegen_json/codegen_json.h&quot;&gt;src/relay/backend/contrib/codegen_json/codegen_json.h&lt;/ [...]
+
+&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+  &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;op:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;kernel&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+  &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;name:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;dnnl.conv2d_relu&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+  &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;inputs:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;s [...]
+  &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;attrs:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+    &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;PartitionedFromPattern:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;nn.conv2d_nn.relu_&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+    &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;shape:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt [...]
+  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+&lt;p&gt;The problem is that we still need the Conv2D attributes such as padding and strides in runtime, but the BYOC JSON serializer only attaches the attributes of the composite function instead of the body operators. On the other hand, the customized DNNL JSON serializer attaches the attributes of the first and only Conv2D in the composite function to generate the following JSON node:&lt;/p&gt;
+
+&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+  &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;op:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;kernel&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+  &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;name:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;dnnl.conv2d_relu&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+  &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;inputs:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;s [...]
+  &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;attrs:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+    &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;shape:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt [...]
+    &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;data_layout:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;NCHW&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+    &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;kernel_layout:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;OIHW&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+    &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;strides:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+    &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;padding:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt [...]
+  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;As can be seen from the DNNL JSON serializer, you can customize the serializer to generate any forms in JSON you like as long as your JSON runtime could interpret them.&lt;/p&gt;
+
+&lt;h3 id=&quot;dnnl-json-runtime&quot;&gt;DNNL JSON Runtime&lt;/h3&gt;
+
+&lt;p&gt;We then implement a DNNL JSON runtime to interpret and execute the serialized JSON graph. We put it under &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/runtime/contrib/dnnl/dnnl_json_runtime.cc&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;src/runtime/contrib/dnnl/dnnl_json_runtime.cc&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;Again, we first register two APIs to create the runtime so that we can use them anywhere. The &lt;code class=&quot;highlighter-rouge&quot;&gt;runtime.DNNLJSONRuntimeCreate&lt;/code&gt; is used in the previous part after serialization, and &lt;code class=&quot;highlighter-rouge&quot;&gt;runtime.module.loadbinary_dnnl_json&lt;/code&gt; could be used when loading the &lt;code class=&quot;highlighter-rouge&quot;&gt;.so&lt;/code&gt; back.&lt;/p&gt;
+
+&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Create a DNNL JSON runtime to interpret and execute the given JSON graph.&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;DNNLJSONRuntimeCreate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;symbol_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot; [...]
+                                      &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;const_names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;make_object&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNLJSONRuntime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;symbol_name&lt;/span&gt;&lt;span class=&quot;p [...]
+  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;TVM_REGISTER_GLOBAL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;runtime.DNNLJSONRuntimeCreate&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+    &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_body_typed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNLJSONRuntimeCreate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+
+&lt;span class=&quot;n&quot;&gt;TVM_REGISTER_GLOBAL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;runtime.module.loadbinary_dnnl_json&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+    &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_body_typed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;JSONRuntimeBase&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LoadFromBinary&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNLJSONRuntime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;s [...]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Now we explain DNNL JSON runtime implementation. The basic class structure is:&lt;/p&gt;
+
+&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DNNLJSONRuntime&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;JSONRuntimeBase&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt;  &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;type_key&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt;  &lt;span class=&quot;s&quot;&gt;&quot;dnnl_json&quot;&lt;/span&gt;&lt;span class=&quot;p& [...]
+  &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Init&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NDArray&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;consts&lt;/span&gt;&lt;span class=&quot;p&q [...]
+    &lt;span class=&quot;c1&quot;&gt;// Initialize the DNNL graph engine.&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;BuildEngine&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
+    
+    &lt;span class=&quot;c1&quot;&gt;// Setup constants entries for weights.&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;CHECK_EQ&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;consts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;const_idx_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;
+      &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;The number of input constants must match the number of required.&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;SetupConstants&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;consts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+
+  &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Run&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;override&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
+   &lt;span class=&quot;c1&quot;&gt;// 1. Fill in the input buffers.&lt;/span&gt;
+   &lt;span class=&quot;c1&quot;&gt;// 2. Invoke the engine through intepreting the stream.&lt;/span&gt;
+   &lt;span class=&quot;c1&quot;&gt;// 3. Read and fill output buffers.&lt;/span&gt;
+  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;Init&lt;/code&gt; function is in charge of building the DNNL engine by interpreting the JSON graph string (see &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/runtime/contrib/dnnl/dnnl_json_runtime.cc#L93&quot;&gt;L93&lt;/a&gt; for &lt;code class=&quot;highlighter-rouge&quot;&gt;BuildEngine&lt;/code&gt;), and filling the constant weights to the corresponding data entry  [...]
+
+&lt;p&gt;Next, the &lt;code class=&quot;highlighter-rouge&quot;&gt;Run&lt;/code&gt; function (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/runtime/contrib/dnnl/dnnl_json_runtime.cc#L64&quot;&gt;L64&lt;/a&gt;) first writes the input tensors, which may come from user inputs or constant weights, to the corresponding DNNL memory buffers we initialized when building the DNNL engine. Then launch the DNNL engine to execute the JSON g [...]
+
+&lt;p&gt;Since the rest implementation in DNNL JSON runtime are too DNNL specific to be dived into details in this post, we will stop here. We would like to emphasize that while the DNNL JSON runtime is a good reference to start with, your JSON runtime could be fully customized to fit your requirements.&lt;/p&gt;
+
+&lt;h2 id=&quot;bring-dnnl-to-tvm-c-source-codegen&quot;&gt;Bring DNNL to TVM: C Source Codegen&lt;/h2&gt;
+&lt;p&gt;Now let’s implement the DNNL codegen that generates C source code which invokes DNNL APIs to execute the Relay graph.&lt;em&gt;Note that if you attempt to implement a codegen to generate other graph representation like in JSON format, you may want to read &lt;a href=&quot;#bring-dnnl-to-tvm-json-codegenruntime&quot;&gt;Bring DNNL to TVM: JSON Codegen/Runtime&lt;/a&gt; and skip this section.&lt;/em&gt;&lt;/p&gt;
+
+&lt;p&gt;To enable DNNL C source codegen in TVM to work on this example, please make sure DNNL is available on your machine, and build the TVM with &lt;code class=&quot;highlighter-rouge&quot;&gt;set(USE_DNNL_CODEGEN C_SRC)&lt;/code&gt; in &lt;code class=&quot;highlighter-rouge&quot;&gt;config.cmake&lt;/code&gt;.&lt;/p&gt;
+
+&lt;p&gt;The DNNL codegen is implemented in &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;src/relay/backend/contrib/dnnl/codegen.cc&lt;/code&gt;&lt;/a&gt;. Since we implemented DNNL codegen in both forms in this file for illustration purpose, you could focus on the part &lt;strong&gt;NOT&lt;/strong&gt; covered by &lt;code class=&quot; [...]
+
+&lt;p&gt;We first register the codegen with TVM registration API (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L510&quot;&gt;L510&lt;/a&gt;). This registration makes TVM compile engine dispatch the Relay function with &lt;code class=&quot;highlighter-rouge&quot;&gt;Compiler=&amp;lt;your codegen&amp;gt;&lt;/code&gt;  to &lt;code class=&quot;highlighter-rouge&quot;&gt;relay.ext.&amp;lt;your  [...]
+
+&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;DNNLCompiler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Object [...]
+  &lt;span class=&quot;n&quot;&gt;DNNLModuleCodegen&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CreateCSourceModule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ref&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;TVM_REGISTER_GLOBAL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;relay.ext.dnnl&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_body_typed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNLCompiler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Note that &lt;strong&gt;&lt;em&gt;each runtime module is only responsible for one Relay function, meaning that you may have several DNNL runtime modules in a single &lt;code class=&quot;highlighter-rouge&quot;&gt;.so&lt;/code&gt; file.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;Then, we derive &lt;code class=&quot;highlighter-rouge&quot;&gt;CSourceModuleCodegenBase&lt;/code&gt; to implement  &lt;code class=&quot;highlighter-rouge&quot;&gt;DNNLModuleCodegen&lt;/code&gt; in &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L362&quot;&gt;L362&lt;/a&gt;. While &lt;code class=&quot;highlighter-rouge&quot;&gt;CSourceModuleCodegenBase&lt;/code&gt; is in charge of ot [...]
+
+&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CreateCSourceModule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt; [...]
+    &lt;span class=&quot;c1&quot;&gt;// Include headers&lt;/span&gt;
+    &lt;span class=&quot;c1&quot;&gt;// ...skip...&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;code_stream_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;#include &amp;lt;dnnl/dnnl_kernel.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+    &lt;span class=&quot;c1&quot;&gt;// ...skip...&lt;/span&gt;
+
+    &lt;span class=&quot;c1&quot;&gt;// &quot;ref&quot; should be the paritioned Relay function with kCompiler=dnnl.&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;CHECK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ref&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IsInstance&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FunctionNode&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;());&lt;/span&gt;
+    &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;res&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GenDNNLFunc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Downcast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Function&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot; [...]
+
+    &lt;span class=&quot;c1&quot;&gt;// &quot;code&quot; is the generated C code with DNNL APIs.&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;code&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;code_stream_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
+
+    &lt;span class=&quot;c1&quot;&gt;// &quot;res&quot; is a tuple of constant weights (symbols, values).&lt;/span&gt;
+    &lt;span class=&quot;c1&quot;&gt;// All constant tensors will be serialzied along with the generated C code&lt;/span&gt;
+    &lt;span class=&quot;c1&quot;&gt;// when export_library is invoked.&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sym&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&g [...]
+    &lt;span class=&quot;n&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;variables&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&am [...]
+
+    &lt;span class=&quot;c1&quot;&gt;// Create a CSource module with all above artifacts.&lt;/span&gt;
+    &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Registry&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Get&lt;/span&gt [...]
+    &lt;span class=&quot;n&quot;&gt;CHECK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nullptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Cannot find csource module to create the external runtime module&quot;&lt;/span&gt;&lt;span class [...]
+    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;code&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;c&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sym&lt;/span&gt;& [...]
+  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Next, we implement &lt;code class=&quot;highlighter-rouge&quot;&gt;GenDNNLFunc&lt;/code&gt; (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L365&quot;&gt;L365&lt;/a&gt;) to generate the compilable C code with DNNL APIs as follows. Please see the embedded comments for the explanations of TVM C source runtime module compatible function interfaces.&lt;/p&gt;
+
+&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// The example Relay graph: conv2d -&amp;gt; add -&amp;gt; relu.&lt;/span&gt;
+&lt;span class=&quot;cp&quot;&gt;#include &amp;lt;cstdint&amp;gt;
+#include &amp;lt;cstdlib&amp;gt;
+#include &amp;lt;cstring&amp;gt;
+#include &amp;lt;vector&amp;gt;
+#include &amp;lt;tvm/runtime/c_runtime_api.h&amp;gt;
+#include &amp;lt;tvm/runtime/container.h&amp;gt;
+#include &amp;lt;tvm/runtime/packed_func.h&amp;gt;
+#include &amp;lt;dlpack/dlpack.h&amp;gt;
+#include &amp;lt;dnnl/dnnl_kernel.h&amp;gt;
+&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;namespace&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;namespace&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;contrib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+
+&lt;span class=&quot;c1&quot;&gt;// Execute the conv2d-&amp;gt;add-&amp;gt;relu graph with DNNL.&lt;/span&gt;
+&lt;span class=&quot;k&quot;&gt;extern&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;C&quot;&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;dnnl_0_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_0_i0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt [...]
+                        &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_0_i2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
+  &lt;span class=&quot;c1&quot;&gt;// Allocate intermediate buffers.&lt;/span&gt;
+  &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span c [...]
+  &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span c [...]
+  &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span c [...]
+
+  &lt;span class=&quot;c1&quot;&gt;// Pre-implemented op-based DNNL functions.&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;dnnl_conv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dnnl_0_i0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_0_i1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span [...]
+  &lt;span class=&quot;n&quot;&gt;dnnl_add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;buf_0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_0_i2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &l [...]
+  &lt;span class=&quot;n&quot;&gt;dnnl_relu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;buf_1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;spa [...]
+
+  &lt;span class=&quot;c1&quot;&gt;// Copy the final output to the corresponding buffer.&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;memcpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span c [...]
+  &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;free&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;buf_0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;free&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;buf_1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;free&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;buf_2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+
+&lt;span class=&quot;c1&quot;&gt;// The wrapper function with all arguments in DLTensor type.&lt;/span&gt;
+&lt;span class=&quot;k&quot;&gt;extern&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;C&quot;&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;dnnl_0_wrapper_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DLTensor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arg0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
+        &lt;span class=&quot;n&quot;&gt;DLTensor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arg1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
+        &lt;span class=&quot;n&quot;&gt;DLTensor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arg2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
+        &lt;span class=&quot;n&quot;&gt;DLTensor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
+
+  &lt;span class=&quot;c1&quot;&gt;// Cast all DLTensor to primitive type buffers and invoke the above&lt;/span&gt;
+  &lt;span class=&quot;c1&quot;&gt;// execution function.&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;dnnl_0_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;static_cast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;& [...]
+  &lt;span class=&quot;n&quot;&gt;static_cast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;static_cast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;static_cast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+
+&lt;span class=&quot;c1&quot;&gt;// The TVM macro to generate TVM runtime compatible function &quot;dnnl_0&quot;&lt;/span&gt;
+&lt;span class=&quot;c1&quot;&gt;// from our generated &quot;dnnl_0_wrapper_&quot;.&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;TVM_DLL_EXPORT_TYPED_FUNC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dnnl_0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_0_wrapper_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Note that the pre-implemented op-based DNNL functions are in &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/runtime/contrib/dnnl/dnnl.cc&quot;&gt;src/runtime/contrib/dnnl/dnnl.cc&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;Since the rest implementation in &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;src/relay/backend/contrib/dnnl/codegen.cc&lt;/code&gt;&lt;/a&gt; are too DNNL specific to be dived into details in this post, we will stop here. The main idea is implementing a Relay graph visitor (&lt;a href=&quot;https://github.com/apache/incubat [...]
+
+&lt;h3 id=&quot;c-source-compilation&quot;&gt;C Source Compilation&lt;/h3&gt;
+&lt;p&gt;As you may have noticed, the output of &lt;code class=&quot;highlighter-rouge&quot;&gt;DNNLCompiler&lt;/code&gt; is a module with the generated C code in text format, which has not been compiled by &lt;code class=&quot;highlighter-rouge&quot;&gt;gcc&lt;/code&gt; to be executable binary. In fact, the generated C code will be compiled when users call &lt;code class=&quot;highlighter-rouge&quot;&gt;export_libray(mod)&lt;/code&gt;, like the following code snippet:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;update_lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
+    &lt;span class=&quot;c1&quot;&gt;# Include the path of src/runtime/contrib/dnnl/dnnl.cc
+&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;test_dir&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dirname&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/spa [...]
+    &lt;span class=&quot;n&quot;&gt;source_dir&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;test_dir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &l [...]
+    &lt;span class=&quot;n&quot;&gt;contrib_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;source_dir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt [...]
+
+    &lt;span class=&quot;c1&quot;&gt;# Setup the gcc flag to compile DNNL code.
+&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;kwargs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;kwargs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;options&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;-O2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;-std=c++14&quot;&lt;/span&gt;&lt;span clas [...]
+    &lt;span class=&quot;n&quot;&gt;tmp_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;util&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tempdir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;lib_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'lib.so'&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;lib_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tmp_path&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relpath&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lib_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+
+    &lt;span class=&quot;c1&quot;&gt;# The generated C code with DNNL APIs is compiled to a binary lib.so.
+&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;export_library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lib_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fcompile&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quo [...]
+
+    &lt;span class=&quot;c1&quot;&gt;# Load the lib.so back to a runtime module.
+&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;load_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lib_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;
+
+&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PassContext&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;opt_level&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt [...]
+    &lt;span class=&quot;n&quot;&gt;json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;param&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;spa [...]
+&lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;update_lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;rt_mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;contrib&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;graph_runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; [...]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;h2 id=&quot;bring-dnnl-to-tvm-build-tvm-with-dnnl-codegenruntime&quot;&gt;Bring DNNL to TVM: Build TVM with DNNL Codegen/Runtime&lt;/h2&gt;
+&lt;p&gt;Finally, we create &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/cmake/modules/contrib/DNNL.cmake&quot;&gt;cmake/modules/contrib/DNNL.cmake&lt;/a&gt; to include the DNNL codegen when building TVM. For demonstration purpose our DNNL codegen has two implementations in the same cmake file. You can only focus on one of them based on your need.&lt;/p&gt;
+
+&lt;p&gt;With the cmake file ready, now users can specify &lt;code class=&quot;highlighter-rouge&quot;&gt;set(USE_DNNL_CODEGEN ON)&lt;/code&gt; in their &lt;code class=&quot;highlighter-rouge&quot;&gt;build/config.cmake&lt;/code&gt; to enable the DNNL codegen.&lt;/p&gt;
+
+&lt;hr /&gt;
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;a href=&quot;https://github.com/zhiics&quot;&gt;Zhi Chen&lt;/a&gt; is a TVM PMC member as well as a senior engineer at SageMaker Neo, Amazon AI, AWS.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;a href=&quot;https://comaniac.github.io&quot;&gt;Cody Yu&lt;/a&gt; is a TVM reviewer as well as an applied scientist at Amazon AI, AWS.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h2 id=&quot;acknowledgment&quot;&gt;Acknowledgment&lt;/h2&gt;
+
+&lt;p&gt;We would like to thank our colleague Animesh Jain for valuable discussions in the framework design; Tianqi Chen and Jared Roesch from OctoML for system design discussions and prototyping; Masahiro Masuda from the TVM community to help code review and improve the DNNL integration. We would also like to thank Ramana Radhakrishnan, Matthew Barrett, Manupa Karunaratne, and Luke Hutton from ARM, U.K. for contributing several helpful ideas, related Relay passes, and the Arm Compute Li [...]
+
+</content>
+ </entry>
+ 
+ <entry>
    <title>Bridging PyTorch and TVM</title>
    <link href="https://tvm.apache.org/2020/07/14/bert-pytorch-tvm"/>
    <updated>2020-07-14T00:00:00-07:00</updated>
@@ -3901,13 +4380,13 @@ We are starting to look at performance optimization and we expect more improveme
 &lt;p&gt;You should see something like this:&lt;/p&gt;
 
 &lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-llvm&quot; data-lang=&quot;llvm&quot;&gt;&lt;span class=&quot;c1&quot;&gt;; ModuleID = 'myadd__kernel0'&lt;/span&gt;
-&lt;span class=&quot;err&quot;&gt;sour&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;e_filename&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;myadd__kernel0&quot;&lt;/span&gt;
+&lt;span class=&quot;err&quot;&gt;source_filename&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;myadd__kernel0&quot;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;target&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;datalayout&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;e-p:32:32-p1:64:64-p2:64:64-p3:32:32-p4:64:64-p5:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64&quot;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;target&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;triple&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;amdgcn-amd-amdhsa-hcc&quot;&lt;/span&gt;
 
 
 &lt;span class=&quot;c1&quot;&gt;; Function Attrs: nounwind&lt;/span&gt;
-&lt;span class=&quot;k&quot;&gt;define&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;dllexport&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;amdgpu_ker&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;ne&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;l&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;vg&quot;&gt;@myadd__kernel0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k [...]
+&lt;span class=&quot;k&quot;&gt;define&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;dllexport&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;amdgpu_kernel&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;vg&quot;&gt;@myadd__kernel0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;addrspace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class [...]
 &lt;span class=&quot;nl&quot;&gt;entry:&lt;/span&gt;
   &lt;span class=&quot;nv&quot;&gt;%4&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;tail&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;call&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;i32&lt;/span&gt; &lt;span class=&quot;vg&quot;&gt;@llvm.amdgcn.workgroup.id.x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
   &lt;span class=&quot;nv&quot;&gt;%5&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;tail&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;call&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;i32&lt;/span&gt; &lt;span class=&quot;vg&quot;&gt;@llvm.amdgcn.workitem.id.x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
@@ -3927,14 +4406,14 @@ We are starting to look at performance optimization and we expect more improveme
   &lt;span class=&quot;nv&quot;&gt;%10&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;add&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;nsw&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;i32&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%.pre-phi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%5&lt;/span&gt;
   &lt;span class=&quot;nv&quot;&gt;%11&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;add&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;nsw&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;i32&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%.pre-phi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%5&lt;/span&gt;
   &lt;span class=&quot;nv&quot;&gt;%12&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sext&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;i32&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%11&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;to&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;i64&lt;/span&gt;
-  &lt;span class=&quot;nv&quot;&gt;%13&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;getelementptr&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;inbounds&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;rspa&lt;/span&gt;&lt;span class=&quot;k&quot;&gt [...]
-  &lt;span class=&quot;nv&quot;&gt;%14&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;load&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;rspa&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;e&lt;/span&gt; [...]
-  &lt;span class=&quot;nv&quot;&gt;%15&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;getelementptr&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;inbounds&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;rspa&lt;/span&gt;&lt;span class=&quot;k&quot;&gt [...]
-  &lt;span class=&quot;nv&quot;&gt;%16&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;load&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;rspa&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;e&lt;/span&gt; [...]
+  &lt;span class=&quot;nv&quot;&gt;%13&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;getelementptr&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;inbounds&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;addrspace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&g [...]
+  &lt;span class=&quot;nv&quot;&gt;%14&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;load&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;addrspace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)*&lt;/span&gt; [...]
+  &lt;span class=&quot;nv&quot;&gt;%15&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;getelementptr&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;inbounds&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;addrspace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&g [...]
+  &lt;span class=&quot;nv&quot;&gt;%16&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;load&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;addrspace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)*&lt;/span&gt; [...]
   &lt;span class=&quot;nv&quot;&gt;%17&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fadd&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%14&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%16&lt;/span&gt;
   &lt;span class=&quot;nv&quot;&gt;%18&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sext&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;i32&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%10&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;to&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;i64&lt;/span&gt;
-  &lt;span class=&quot;nv&quot;&gt;%19&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;getelementptr&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;inbounds&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;rspa&lt;/span&gt;&lt;span class=&quot;k&quot;&gt [...]
-  &lt;span class=&quot;k&quot;&gt;store&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%17&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;rspa&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; [...]
+  &lt;span class=&quot;nv&quot;&gt;%19&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;getelementptr&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;inbounds&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;addrspace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&g [...]
+  &lt;span class=&quot;k&quot;&gt;store&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%17&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;addrspace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)*&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%19&lt;/span [...]
   &lt;span class=&quot;k&quot;&gt;br&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;label&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%if_end&lt;/span&gt;
 
 
@@ -4101,584 +4580,5 @@ We also learns from Halide when implementing the lowering pipeline in TVM.&lt;/l
 </content>
  </entry>
  
- <entry>
-   <title>Optimize Deep Learning GPU Operators with TVM: A Depthwise Convolution Example</title>
-   <link href="https://tvm.apache.org/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example"/>
-   <updated>2017-08-22T00:00:00-07:00</updated>
-   <id>https://tvm.apache.org/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example</id>
-   <content type="html">&lt;p&gt;Efficient deep learning operators are at the core of deep learning systems.
-Usually these operators are hard to optimize and require great efforts of HPC experts.
-&lt;a href=&quot;https://github.com/dmlc/tvm&quot;&gt;TVM&lt;/a&gt;, an end to end tensor IR/DSL stack, makes this much easier.&lt;/p&gt;
-
-&lt;p&gt;This blog teaches you how to write high-performance GPU operator kernels with the help of TVM.
-We use depthwise convolution (i.e. &lt;a href=&quot;http://docs.tvmlang.org/api/python/topi.html#topi.nn.depthwise_conv2d_nchw&quot;&gt;topi.nn.depthwise_conv2d_nchw&lt;/a&gt;) as an example,
-and demonstrate how we can improve over the already hand optimized CUDA kernel in tensorflow.
-Our final version is 2x-4x faster than the optimized kernel in tf-1.2 under different workloads, and 3x-7x faster with operator fusion enabled.
-Below is the result tested on GTX1080, with filter size = [1, 256, 3, 3], stride = [1, 1], padding = ‘SAME’:&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/depthconv_tutorial/tf_compare.png&quot; alt=&quot;image&quot; width=&quot;95%&quot; /&gt;&lt;/p&gt;
-
-&lt;h2 id=&quot;introduction-to-depthwise-convolution&quot;&gt;Introduction to Depthwise Convolution&lt;/h2&gt;
-
-&lt;p&gt;Depthwise convolution is an important building block of modern architectures, such as Xception [1] and MobileNet [2].
-It’s an effective method to reduce the computation complexity of deep neural networks.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/depthconv_tutorial/conv_and_depthconv.png&quot; alt=&quot;image&quot; width=&quot;80%&quot; /&gt;&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;source: &lt;a href=&quot;http://machinethink.net/blog/googles-mobile-net-architecture-on-iphone/&quot;&gt;http://machinethink.net/blog/googles-mobile-net-architecture-on-iphone/&lt;/a&gt;&lt;/p&gt;
-
-&lt;p&gt;In TVM, depthwise convolution can be declared as:&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# padding stage
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PaddedInput&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;compute&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
-    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;in_channel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;height_after_pad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;width_after_pad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
-    &lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=& [...]
-        &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;all&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pad_top&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;  [...]
-        &lt;span class=&quot;n&quot;&gt;Input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pad_top&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span  [...]
-    &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;PaddedInput&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
-&lt;span class=&quot;c1&quot;&gt;# depthconv stage
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;di&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reduce_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;filter_height&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),& [...]
-&lt;span class=&quot;n&quot;&gt;dj&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reduce_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;filter_width&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; & [...]
-&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;compute&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
-    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out_channel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out_height&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out_width&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
-    &lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=& [...]
-        &lt;span class=&quot;n&quot;&gt;PaddedInput&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;channel_multiplier&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/spa [...]
-        &lt;span class=&quot;n&quot;&gt;axis&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;di&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dj&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]),&lt;/span&gt;
-    &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'DepthwiseConv2d'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;h2 id=&quot;general-gpu-optimization-guidelines&quot;&gt;General GPU Optimization Guidelines&lt;/h2&gt;
-
-&lt;p&gt;This part briefly talks about three concepts we should know when optimizing CUDA code: data reuse, shared memory and bank conflicts.
-It would be great if you already know them, then you may skip this part.&lt;/p&gt;
-
-&lt;h3 id=&quot;data-reuse&quot;&gt;Data Reuse&lt;/h3&gt;
-&lt;p&gt;In modern computing architectures, the cost of loading data from memory is much higher than doing a single floating point computation [3].
-Because of that, we always want to reuse the input data after they are loaded into registers or shared memory (cache).&lt;/p&gt;
-
-&lt;p&gt;There are two forms of data reuse in depthwise convolution: filter reuse and input reuse. Filter reuse happens as the filter slides over the input channel and computes multiple times.
-Input reuse is realized through tiling, let’s take 3x3 depthwise conv as an example:&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/depthconv_tutorial/no_tiling.png&quot; alt=&quot;image&quot; width=&quot;70%&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Without tiling, each thread computes 1 output element and loads 3x3 input data. 16 threads together have 9x16 loads.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/depthconv_tutorial/tiling.png&quot; alt=&quot;image&quot; width=&quot;70%&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;With tiling, each thread computes 2x2 output elements and loads 4x4 input data. 4 threads together have 16x4 loads.&lt;/p&gt;
-
-&lt;h3 id=&quot;shared-memory-and-bank-conflicts&quot;&gt;Shared Memory and Bank Conflicts&lt;/h3&gt;
-&lt;p&gt;Shared memory can be seen as cache in GPU. It is on-chip and much faster than global memory.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/depthconv_tutorial/GPU_memory_hierarchy.png&quot; alt=&quot;image&quot; width=&quot;256px&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Shared memory is allocated per block. It’s common practice to load data from global memory into shared memory, and then all threads in the block read data from shared memory.&lt;/p&gt;
-
-&lt;p&gt;The size of shared memory is limited (usually 48K), so we must be cautious of shared memory overflow.
-Besides, too much shared memory allocated to one block limits the number of active blocks per multiprocessor.&lt;/p&gt;
-
-&lt;p&gt;Another performance issue with shared memory is bank conflicts. Shared memory is divided into equally sized memory modules (banks) that can be accessed simultaneously,
-however, if multiple threads access the same memory bank (causing bank conflicts), the accesses will be serialized, thus decreasing the effective bandwidth.&lt;/p&gt;
-
-&lt;p&gt;Shared memory banks are organized such that successive addresses are assigned to successive banks.
-To avoid bank conflicts, it’s better that successive threads access successive memory addresses, as illustrated below (each color represents one shared memory bank):&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/depthconv_tutorial/bank_conflicts.png&quot; alt=&quot;image&quot; width=&quot;95%&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;For more details on shared memory and bank conflicts, please refer to &lt;a href=&quot;https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/&quot;&gt;this Nvidia’s blog&lt;/a&gt;.&lt;/p&gt;
-
-&lt;p&gt;Ok, now let’s start optimizing depthwise convolution in TVM.&lt;/p&gt;
-
-&lt;h2 id=&quot;schedule-optimization&quot;&gt;Schedule Optimization&lt;/h2&gt;
-
-&lt;h3 id=&quot;compute-paddedinput-inline-to-save-memory-allocation&quot;&gt;Compute PaddedInput Inline to Save Memory Allocation&lt;/h3&gt;
-&lt;p&gt;As we see from part 1, padding is declared explicitly as a separate stage. We compute it inline to avoid redundant memory allocation:&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create_schedule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/sp [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PaddedInput&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;compute_inline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;h3 id=&quot;divide-one-large-channel-into-smaller-blocks&quot;&gt;Divide One Large Channel into Smaller Blocks&lt;/h3&gt;
-&lt;p&gt;One straightforward schedule for depthwise convolution is that one cuda block takes care of one input channel and corresponding filters, loading them into shared memory and then computing:&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;IS&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cache_read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PaddedInput&lt;/spa [...]
-&lt;span class=&quot;n&quot;&gt;FS&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cache_read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Filter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;shared&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span& [...]
-&lt;span class=&quot;n&quot;&gt;block_y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thread_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;blockIdx.y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;block_x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thread_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;blockIdx.x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
-&lt;span class=&quot;c1&quot;&gt;# bind the dimension of batch (N in NCHW) with block_y
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op&lt;/span&gt;&lt;s [...]
-&lt;span class=&quot;c1&quot;&gt;# bind the dimension of channel (C in NCHW) with block_x
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op&lt;/span&gt;&lt;s [...]
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;We test the average time cost of 1000 runs on GTX 1080, and compare with &lt;a href=&quot;https://www.tensorflow.org/versions/r0.12/api_docs/python/nn/convolution#depthwise_conv2d&quot;&gt;depthwise_conv2d in tensorflow&lt;/a&gt;.
-Here is the result:&lt;/p&gt;
-
-&lt;table&gt;
-  &lt;thead&gt;
-    &lt;tr&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;Input&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;Filter&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;stride&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;tf-1.2 SAME pad (us)&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;TVM SAME pad (us)&lt;/th&gt;
-    &lt;/tr&gt;
-  &lt;/thead&gt;
-  &lt;tbody&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 21, 21]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[256, 1, 3, 3]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 1]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;16.1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;9.1&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 32, 32]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[256, 1, 3, 3]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 1]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;34.8&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;14.5&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 64, 64]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[256, 1, 3, 3]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 1]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;130.9&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;98.9&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[256, 1, 3, 3]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 1]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;251.6&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;387.4&lt;/td&gt;
-    &lt;/tr&gt;
-  &lt;/tbody&gt;
-&lt;/table&gt;
-
-&lt;p&gt;As we can see, this schedule performs well with small channel size like 21 x 21 or 32 x 32, however, its performance drops seriously as the channel size increases to larger than 64 x 64.
-One main reason is that too much shared memory allocated to one block limits the number of active blocks per multiprocessor.&lt;/p&gt;
-
-&lt;p&gt;We modify the schedule to divide one large channel into smaller blocks. For example, one channel (64 x 64 or 96 x 96) is divided into blocks of 32 x 32,
-and one cuda block takes care of one 32 x 32 block:&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;blocking_h&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;blocking_w&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;
-&lt;span class=&quot;c1&quot;&gt;# split the dimension of height (H in NCHW)
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bx1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;s [...]
-&lt;span class=&quot;c1&quot;&gt;# split the dimension of width (W in NCHW)
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bx2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;s [...]
-&lt;span class=&quot;c1&quot;&gt;# assign one 32 x 32 block to one cuda block
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fuse&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;block_y&lt;/span&gt;&lt;span class=& [...]
-&lt;span class=&quot;n&quot;&gt;bx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fuse&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bx1&lt;/span&gt;&lt;span class=&quo [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;block_x&lt;/span&gt;&lt;span class=& [...]
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;Here is the new result:&lt;/p&gt;
-
-&lt;table&gt;
-  &lt;thead&gt;
-    &lt;tr&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;Input&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;[blocking_h, blocking_w]&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;tf-1.2 SAME pad (us)&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;TVM SAME pad (us)&lt;/th&gt;
-    &lt;/tr&gt;
-  &lt;/thead&gt;
-  &lt;tbody&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 64, 64]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[32, 32]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;130.9&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;63.4&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[32, 32]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;251.6&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;132.5&lt;/td&gt;
-    &lt;/tr&gt;
-  &lt;/tbody&gt;
-&lt;/table&gt;
-
-&lt;p&gt;Our blocking strategy works! For 64 x 64 channel size, it brings 1.6x acceleration (98.9us -&amp;gt; 63.4us); for 96 x 96 channel size, it brings 2.9x acceleration (387.4us -&amp;gt; 132.5us).&lt;/p&gt;
-
-&lt;h3 id=&quot;tuning-parameters-of-thread-numbers&quot;&gt;Tuning Parameters of Thread Numbers&lt;/h3&gt;
-
-&lt;p&gt;How to schedule the workload, say, 32x32 among the threads of one cuda block? Intuitively, it should be like this:&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;num_thread_x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;thread_y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thread_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span [...]
-&lt;span class=&quot;n&quot;&gt;thread_x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thread_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span [...]
-&lt;span class=&quot;n&quot;&gt;ty&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;yi&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&qu [...]
-&lt;span class=&quot;n&quot;&gt;tx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;xi&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&qu [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reorder&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ty&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tx&lt;/span&gt;&lt;span class=&qu [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ty&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;thread_y&lt;/span&gt;&lt;span class= [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;thread_x&lt;/span&gt;&lt;span class= [...]
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;There are two parameters in the schedule: &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_y&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_x&lt;/code&gt;. How to determine the optimal combination of them? 
-Well, let’s first do some experiments. Below is the result with Filter = [256, 1, 3, 3] and stride = [1, 1]:&lt;/p&gt;
-
-&lt;table&gt;
-  &lt;thead&gt;
-    &lt;tr&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;Case&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;Input&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;num_thread_y&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;num_thread_x&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;TVM SAME pad (us)&lt;/th&gt;
-    &lt;/tr&gt;
-  &lt;/thead&gt;
-  &lt;tbody&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 32, 32]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;8&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;32&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;9.7&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;2&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 32, 32]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;4&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;32&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;8.8&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;3&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 32, 32]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;32&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;17.7&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;4&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 32, 32]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;32&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;32.5&lt;/td&gt;
-    &lt;/tr&gt;
-  &lt;/tbody&gt;
-&lt;/table&gt;
-
-&lt;p&gt;Many interesting observations from above results:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;
-    &lt;p&gt;Case 2 is faster than case 1. In case 2, each thread computes a 8x1 tile in output, which corresponds to a 10x3 tile in input.
-It has better data reuse than case 1’s 4x1 tile.&lt;/p&gt;
-  &lt;/li&gt;
-  &lt;li&gt;
-    &lt;p&gt;Case 3 is slower than case 2. It’s because in case 3, the workload per thread is too large and leads to much cost of local memory read.&lt;/p&gt;
-  &lt;/li&gt;
-  &lt;li&gt;
-    &lt;p&gt;Case 4 is slower than case 3. It’s because &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_x = 32&lt;/code&gt; ensures no bank conflicts, while &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_y = 32&lt;/code&gt; doesn’t.&lt;/p&gt;
-  &lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;To summarize what we learn from above observations:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;Large tile is good for data reuse, but not good for local memory read.&lt;/li&gt;
-  &lt;li&gt;The influence of &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_y&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_x&lt;/code&gt; on bank conflicts is asymmetric.&lt;/li&gt;
-  &lt;li&gt;To find the optimal combination of &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_y&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_x&lt;/code&gt; is to achieve a balance of efficient shared memory access (avoid bank conflicts), data reuse, and local memory read.&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;Pretty tricky. So, what exactly should we do to find the optimal combination? The answer is brute force search. 
-We can pass &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_y&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_x&lt;/code&gt; as arguments to the schedule function, and try all possible combinations to find the optimal one. This can be done easily in TVM:&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;schedule_depthwise_conv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt;&lt;span class=&quot; [...]
-    &lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt;
-    &lt;span class=&quot;n&quot;&gt;num_thread_x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_x&lt;/span&gt;
-    &lt;span class=&quot;n&quot;&gt;do_schedule_as_usual&lt;/span&gt;
-    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;schedule&lt;/span&gt;
-
-&lt;span class=&quot;n&quot;&gt;min_time_cost&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inf&lt;/span&gt;
-&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;all_possible_combinations&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
-    &lt;span class=&quot;n&quot;&gt;schedule&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;schedule_depthwise_conv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt;&lt;span class=& [...]
-    &lt;span class=&quot;n&quot;&gt;time_cost&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;test_depthwise_conv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;schedule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
-    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time_cost&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;min_time_cost&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
-        &lt;span class=&quot;n&quot;&gt;min_time_cost&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time_cost&lt;/span&gt;
-        &lt;span class=&quot;n&quot;&gt;optimal_combination&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;In fact, it can be seen as a simple auto scheduler.&lt;/p&gt;
-
-&lt;h3 id=&quot;vthread-and-strided-patterns&quot;&gt;Vthread and Strided Patterns&lt;/h3&gt;
-&lt;p&gt;Vthread (virtual thread) in TVM is introduced to support strided patterns. We can use it this way:&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;num_vthread_y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;num_vthread_x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;num_thread_x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;thread_vy&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thread_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_vthread_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/sp [...]
-&lt;span class=&quot;n&quot;&gt;thread_vx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thread_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_vthread_x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/sp [...]
-&lt;span class=&quot;n&quot;&gt;thread_y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thread_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span [...]
-&lt;span class=&quot;n&quot;&gt;thread_x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thread_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span [...]
-&lt;span class=&quot;c1&quot;&gt;# split the dimension of height (H in NCHW) twice
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vyi&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt [...]
-&lt;span class=&quot;n&quot;&gt;ty&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;yi&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&qu [...]
-&lt;span class=&quot;c1&quot;&gt;# split the dimension of width (W in NCHW) twice
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vxi&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt [...]
-&lt;span class=&quot;n&quot;&gt;tx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;xi&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&qu [...]
-&lt;span class=&quot;c1&quot;&gt;# bind thread and vthread respectively
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;thread_vy&lt;/span&gt; [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;thread_vx&lt;/span&gt;&lt;span clas [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ty&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;thread_y&lt;/span&gt;&lt;span class= [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;thread_x&lt;/span&gt;&lt;span class= [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reorder&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvx&lt;/span&gt;&lt;span class=& [...]
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;Let’s print the IR to see what vthread does:&lt;/p&gt;
-
-&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cm&quot;&gt;/* Input = [1, 1, 32, 32], Filter = [1, 1, 3, 3], stride = [1, 1], padding = 'SAME' */&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;produce&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(blockIdx.y, , blockIdx.y)] thread_extent = 1&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 1&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(threadIdx.y, Range(min=0, extent=8), threadIdx.y)] thread_extent = 8&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(threadIdx.x, Range(min=0, extent=8), threadIdx.x)] thread_extent = 8&lt;/span&gt;
-  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class [...]
-    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span cla [...]
-      &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)& [...]
-      &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;) [...]
-      &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;) [...]
-      &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;) [...]
-      &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;di&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dj&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-          &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&g [...]
-          &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;& [...]
-          &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;& [...]
-          &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;& [...]
-        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-      &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;Without vthread (just set to 1), the IR is:&lt;/p&gt;
-
-&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cm&quot;&gt;/* Input = [1, 1, 32, 32], Filter = [1, 1, 3, 3], stride = [1, 1], padding = 'SAME' */&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;produce&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(blockIdx.y, , blockIdx.y)] thread_extent = 1&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 1&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(threadIdx.y, Range(min=0, extent=8), threadIdx.y)] thread_extent = 8&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(threadIdx.x, Range(min=0, extent=8), threadIdx.x)] thread_extent = 8&lt;/span&gt;
-  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class [...]
-    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span cla [...]
-      &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)& [...]
-      &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;di&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dj&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-          &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&g [...]
-        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-      &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;As we can see, when &lt;code class=&quot;highlighter-rouge&quot;&gt;num_vthread_y = 2&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;num_vthread_x = 2&lt;/code&gt;, the 32 x 32 channel is divided into four sub-channels of 16 x 16.
-Each thread computes four output elements at a time, one element in one sub-channel.&lt;/p&gt;
-
-&lt;p&gt;Below is the result with Filter = [256, 1, 3, 3], stride = [1, 1], blocking_h = 32, blocking_w = 32:&lt;/p&gt;
-
-&lt;style&gt;
-table th:nth-of-type(1) {
-    width: 120px;
-}
-table th:nth-of-type(2) {
-    width: 120px;
-}
-&lt;/style&gt;
-
-&lt;table&gt;
-  &lt;thead&gt;
-    &lt;tr&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;Case&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;Input&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;num_thread_y, num_thread_x&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;num_vthread_y, num_vthread_x&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;TVM SAME pad (us)&lt;/th&gt;
-    &lt;/tr&gt;
-  &lt;/thead&gt;
-  &lt;tbody&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;8, 8&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1, 1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;132.5&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;2&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;8, 8&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1, 4&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;103.1&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;3&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;4, 32&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1, 1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;95.9&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;4&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;8, 16&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1, 2&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;90.9&lt;/td&gt;
-    &lt;/tr&gt;
-  &lt;/tbody&gt;
-&lt;/table&gt;
-
-&lt;p&gt;Case 2 is faster than case 1. It’s because in case 2 &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_x=8&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;num_vthread_x=4&lt;/code&gt; together ensures that consecutive threads access consecutive memory addresses,
-thus avoiding bank conflicts, as illustrated below (each color represents one thread’s workload):&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/depthconv_tutorial/vthread_and_strided_pattern.png&quot; alt=&quot;image&quot; width=&quot;90%&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;In theory case 3 and 4 should be the same fast, since they have the same workload per thread, and both enjoy efficient shared memory access. Somehow case 4 is just a little faster.&lt;/p&gt;
-
-&lt;p&gt;Still remember tensorflow’s speed? It’s 251.6us, and now TVM is 2.8x faster. 387.4 -&amp;gt; 132.5 -&amp;gt; 95.9 -&amp;gt; 90.9, blocking helps the most; tuning thread numbers saves 37us;
-vthread saves additional 5us.&lt;/p&gt;
-
-&lt;p&gt;In fact, TVM can be extremely faster than tensorflow with large kernel size or channel_multiplier (because of more filter reuse) :&lt;/p&gt;
-
-&lt;table&gt;
-  &lt;thead&gt;
-    &lt;tr&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;Input&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;Filter&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;stride&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;tf-1.2 SAME pad (us)&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;TVM SAME pad (us)&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;How faster is TVM&lt;/th&gt;
-    &lt;/tr&gt;
-  &lt;/thead&gt;
-  &lt;tbody&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[256, 1, 3, 3]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 1]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;251.6&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;90.9&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;2.8x&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[256, 1, 5, 5]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 1]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;597.6&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;128.9&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;4.6x&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[256, 2, 3, 3]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 1]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;659.9&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;143.7&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;4.6x&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[256, 2, 5, 5]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 1]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1203.9&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;170.5&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;7.1x&lt;/td&gt;
-    &lt;/tr&gt;
-  &lt;/tbody&gt;
-&lt;/table&gt;
-
-&lt;h2 id=&quot;operator-fusion&quot;&gt;Operator Fusion&lt;/h2&gt;
-
-&lt;p&gt;One typical optimization we can do in deep learning is operator fusion, that computes multiple operators together in a single kernel without saving intermediate results back to global memory.
-TVM supports that out of the box.&lt;/p&gt;
-
-&lt;p&gt;Consider a common pattern in neural networks: &lt;code class=&quot;highlighter-rouge&quot;&gt;depthwise_conv2d&lt;/code&gt; + &lt;code class=&quot;highlighter-rouge&quot;&gt;scale_shift&lt;/code&gt; + &lt;code class=&quot;highlighter-rouge&quot;&gt;relu&lt;/code&gt;. We can fuse the three operators into one, by slightly modifying the original schedule:&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;topi&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;depthwise_c [...]
-&lt;span class=&quot;n&quot;&gt;ScaleShift&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;topi&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;scale_shift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/s [...]
-&lt;span class=&quot;n&quot;&gt;Relu&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;topi&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ScaleShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
-
-&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Relu&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# is no longer DepthwiseConv2d
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ScaleShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;compute_inline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# this line fuses ScaleShift, explicitly
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_scope&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;local&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&qu [...]
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;schedule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# schedule for Output the same way we schedule for DepthwiseConv2d as discussed above
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;compute_at&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt; [...]
-&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;It generates IR like this:&lt;/p&gt;
-
-&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cm&quot;&gt;/* Input = [1, 1, 32, 32], Filter = [1, 1, 3, 3], stride = [1, 1], padding = 'SAME' */&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;produce&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Relu&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(blockIdx.y, , blockIdx.y)] thread_extent = 1&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [DepthwiseConv2d] storage_scope = &quot;local&quot;&lt;/span&gt;
-  &lt;span class=&quot;n&quot;&gt;allocate&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;float32&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/s [...]
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 1&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(threadIdx.y, Range(min=0, extent=8), threadIdx.y)] thread_extent = 8&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(threadIdx.x, Range(min=0, extent=8), threadIdx.x)] thread_extent = 8&lt;/span&gt;
-  &lt;span class=&quot;n&quot;&gt;produce&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-      &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-        &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &l [...]
-        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;di&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-          &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dj&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-            &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt [...]
-          &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-      &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span clas [...]
-    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span cl [...]
-      &lt;span class=&quot;n&quot;&gt;Relu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt [...]
-    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;As we can see, each thread computes &lt;code class=&quot;highlighter-rouge&quot;&gt;scale_shift&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;relu&lt;/code&gt; before writing the result of &lt;code class=&quot;highlighter-rouge&quot;&gt;depthwise_conv2d&lt;/code&gt; to global memory. The fused operator is as fast as single &lt;code class=&quot;highlighter-rouge&quot;&gt;depthwise_conv2d&lt;/code&gt;.
-Below is the result with Input = [1, 256, 96, 96], Filter = [256, 1, 3, 3], stride = [1, 1], padding = ‘SAME’:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;tf-1.2 &lt;code class=&quot;highlighter-rouge&quot;&gt;depthwise_conv2d&lt;/code&gt;: 251.6 us&lt;/li&gt;
-  &lt;li&gt;tf-1.2 &lt;code class=&quot;highlighter-rouge&quot;&gt;depthwise_conv2d&lt;/code&gt; + &lt;code class=&quot;highlighter-rouge&quot;&gt;scale_shift&lt;/code&gt; + &lt;code class=&quot;highlighter-rouge&quot;&gt;relu&lt;/code&gt; (separate): 419.9 us&lt;/li&gt;
-  &lt;li&gt;TVM &lt;code class=&quot;highlighter-rouge&quot;&gt;depthwise_conv2d&lt;/code&gt;: 90.9 us&lt;/li&gt;
-  &lt;li&gt;TVM &lt;code class=&quot;highlighter-rouge&quot;&gt;depthwise_conv2d + scale_shift + relu&lt;/code&gt; (fused): 91.5 us&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;The advantage of operator fusion is obvious.&lt;/p&gt;
-
-&lt;p&gt;This is not the end, TVM can do operator fusion in a smarter way. You may refer to &lt;a href=&quot;https://github.com/dmlc/tvm/issues/215&quot;&gt;this&lt;/a&gt; and read the source code provided below.&lt;/p&gt;
-
-&lt;h2 id=&quot;show-me-the-code&quot;&gt;Show me the code&lt;/h2&gt;
-&lt;ul&gt;
-  &lt;li&gt;Declare: &lt;a href=&quot;https://github.com/dmlc/tvm/blob/master/topi/python/topi/nn/depthwise_conv2d.py&quot;&gt;https://github.com/dmlc/tvm/blob/master/topi/python/topi/nn/depthwise_conv2d.py&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;Schedule: &lt;a href=&quot;https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/depthwise_conv2d.py&quot;&gt;https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/depthwise_conv2d.py&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;Test: &lt;a href=&quot;https://github.com/dmlc/tvm/blob/master/topi/recipe/conv/depthwise_conv2d_test.py&quot;&gt;https://github.com/dmlc/tvm/blob/master/topi/recipe/conv/depthwise_conv2d_test.py&lt;/a&gt;&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;
-&lt;p&gt;The author has many thanks to Tianqi Chen for his helpful advice and inspiring discussion.&lt;/p&gt;
-
-&lt;h2 id=&quot;bio&quot;&gt;Bio&lt;/h2&gt;
-&lt;p&gt;&lt;a href=&quot;https://Huyuwei.github.io&quot;&gt;Yuwei Hu&lt;/a&gt; is an intern in &lt;a href=&quot;http://tusimple.ai/&quot;&gt;Tusimple&lt;/a&gt;’s HPC group.
-He is experiencing a gap year after obtaining a bachelor’s degree in electrical engineering from Beihang University.&lt;/p&gt;
-
-&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;
-&lt;p&gt;[1] &lt;a href=&quot;https://arxiv.org/abs/1610.02357&quot;&gt;Xception: Deep Learning with Depthwise Separable Convolutions&lt;/a&gt;&lt;/p&gt;
-
-&lt;p&gt;[2] &lt;a href=&quot;https://arxiv.org/abs/1704.04861&quot;&gt;MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications&lt;/a&gt;&lt;/p&gt;
-
-&lt;p&gt;[3] &lt;a href=&quot;http://norvig.com/21-days.html#answers&quot;&gt;Approximate timing for various operations on a typical PC&lt;/a&gt;&lt;/p&gt;
-</content>
- </entry>
- 
  
 </feed>
diff --git a/blog.html b/blog.html
index eff9ea6..a0846fe 100644
--- a/blog.html
+++ b/blog.html
@@ -156,6 +156,16 @@
 
 <li>
   <span>
+    <a class="post-link" href="/2020/07/15/how-to-bring-your-own-codegen-to-tvm">How to Bring Your Own Codegen to TVM</a>
+  </span>
+  </br>
+  <span>
+    Jul 15, 2020
+  </span>
+</li>
+
+<li>
+  <span>
     <a class="post-link" href="/2020/07/14/bert-pytorch-tvm">Bridging PyTorch and TVM</a>
   </span>
   </br>
diff --git a/images/bring-your-own-codegen/after_annotation.png b/images/bring-your-own-codegen/after_annotation.png
new file mode 100644
index 0000000..746cc39
Binary files /dev/null and b/images/bring-your-own-codegen/after_annotation.png differ
diff --git a/images/bring-your-own-codegen/after_merging_regions.png b/images/bring-your-own-codegen/after_merging_regions.png
new file mode 100644
index 0000000..9e70cfd
Binary files /dev/null and b/images/bring-your-own-codegen/after_merging_regions.png differ
diff --git a/images/bring-your-own-codegen/after_partitioning.png b/images/bring-your-own-codegen/after_partitioning.png
new file mode 100644
index 0000000..3c59c50
Binary files /dev/null and b/images/bring-your-own-codegen/after_partitioning.png differ
diff --git a/images/bring-your-own-codegen/original_graph.png b/images/bring-your-own-codegen/original_graph.png
new file mode 100644
index 0000000..d37b72a
Binary files /dev/null and b/images/bring-your-own-codegen/original_graph.png differ
diff --git a/rss.xml b/rss.xml
index c9886c9..2fa0fda 100644
--- a/rss.xml
+++ b/rss.xml
@@ -5,12 +5,491 @@
         <description>TVM - </description>
         <link>https://tvm.apache.org</link>
         <atom:link href="https://tvm.apache.org" rel="self" type="application/rss+xml" />
-        <lastBuildDate>Tue, 14 Jul 2020 09:12:02 -0700</lastBuildDate>
-        <pubDate>Tue, 14 Jul 2020 09:12:02 -0700</pubDate>
+        <lastBuildDate>Wed, 15 Jul 2020 09:54:04 -0700</lastBuildDate>
+        <pubDate>Wed, 15 Jul 2020 09:54:04 -0700</pubDate>
         <ttl>60</ttl>
 
 
         <item>
+                <title>How to Bring Your Own Codegen to TVM</title>
+                <description>&lt;p&gt;To free data scientists from worrying about the performance when developing a new model, hardware backend providers (e.g., Intel, NVIDIA, ARM, etc) either provide kernel libraries such as cuBLAS or cuDNN with many commonly used deep learning kernels, or provide frameworks such as DNNL or TensorRT with a graph engine to let users describe their models in a certain way to achieve high performance. In addition, emerging deep learning accelerators also h [...]
+
+&lt;p&gt;However, users have to learn a new programming interface when they attempt to work on a new kernel library or a device. As a result, the demand for a unified programming interface becomes more and more important to let all users and hardware backend providers stand on the same page.&lt;/p&gt;
+
+&lt;p&gt;To share the programming interface with widely used deep learning frameworks, many hardware device providers have attempted to integrate their devices backend to TensorFlow. However, since TensorFlow does not provide an official backend interface for new backends, you have to hack the TensorFlow for registration, which involves many source file changes and makes the future maintenance difficult.&lt;/p&gt;
+
+&lt;p&gt;In this post, we demonstrate how you, as a hardware backend provider, can easily leverage the Bring Your Own Codegen (BYOC) framework to integrate the kernel library/compiler/framework of your hardware device to TVM. The most important advantage of leveraging BYOC framework is that &lt;strong&gt;&lt;em&gt;all related source files of your devices are self-contained, so the codegen/runtime of your devices are pluggable to the TVM code base.&lt;/em&gt;&lt;/strong&gt; It means that  [...]
+
+&lt;p&gt;In the rest of this post, we first illustrate a scenario that you may need TVM with BYOC, followed by an overview of the BYOC compilation and runtime flows. Then, we step-by-step illustrate how to integrate a vendor library or an execution engine to TVM with BYOC by using Intel DNNL (a.k.a. MKL-DNN, OneDNN) as a running example.&lt;/p&gt;
+
+&lt;h2 id=&quot;bring-an-asic-accelerator-to-tvm&quot;&gt;Bring an ASIC Accelerator to TVM&lt;/h2&gt;
+
+&lt;p&gt;Let’s first make a scenario to illustrate why you want to bring your accelerator to TVM and what features you can expect from the BYOC framework. If you are not sure whether your case is suitable for BYOC, you are welcome to raise a discussion at &lt;a href=&quot;https://discuss.tvm.ai&quot;&gt;discuss.tvm.ai&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;Imagining that you just made an edge device platform with an ARM CPU and a fantastic accelerator that has achieved amazing performance for common image classification models. In other words, your accelerator does well on Conv2D, ReLU, GEMM, and other widely used CNN operators.&lt;/p&gt;
+
+&lt;p&gt;Unfortunately, object detection models are getting more and more popular as well, and your customers need to run both image classification and object detection models on your platform. Although your accelerator is capable of executing almost all operators in object detection models, one operator (e.g., non-maximum suppression, NMS) is missing.&lt;/p&gt;
+
+&lt;h3 id=&quot;let-tvm-execute-unsupported-operators&quot;&gt;Let TVM execute unsupported operators&lt;/h3&gt;
+&lt;p&gt;Since TVM has multiple codegens for different backends, it is easy for the open source community to implement new operators on CPU or GPU in a short time. Ideally, if you integrate the compilation flow of your accelerator to TVM with BYOC, TVM will perform Relay graph partitioning to offload a part of the graph to your accelerator while keeping others on TVM. As a result, you can claim that your platform is capable of running all models without worrying about new operators.&lt;/p&gt;
+
+&lt;h3 id=&quot;customize-graph-level-optimization&quot;&gt;Customize graph-level optimization&lt;/h3&gt;
+&lt;p&gt;Your ASIC accelerator must have its own compilation flow. Usually, it could be one of the following cases:&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Generate a graph representation and feed it to a graph engine&lt;/strong&gt;:
+You may have your own graph engine that is capable of executing a graph (or a neural network model) on your accelerator. For example, both Intel DNNL and NVIDIA TensorRT use an engine to run a whole graph or a model, so that they are able to 1) reduce memory transaction between operators and 2) optimize graph execution with operator fusion.&lt;/p&gt;
+
+&lt;p&gt;In order to achieve the above two optimizations, you may need to process the graph during the compilation time. For example, Conv2D and bias addition are two separate operators in TVM, but they may be one operator (Conv2D with bias addition capability) on your accelerator. In this case, you may want to optimize the graph by replacing the &lt;code class=&quot;highlighter-rouge&quot;&gt;conv2d - add&lt;/code&gt; graph pattern to a &lt;code class=&quot;highlighter-rouge&quot;&gt;yo [...]
+
+&lt;p&gt;If your compilation flow falls into this case, then we recommend reading all the rest sections in this post but skipping &lt;a href=&quot;#bring-dnnl-to-tvm-c-source-codegen&quot;&gt;Bring DNNL to TVM: C Source Codegen&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Generate assembly code and compile it to an executable binary&lt;/strong&gt;:
+If you do not have an end-to-end execution framework for your platform like the previous case, you may have a compiler to compile a program in assembly code of your ISA. In order to feed the assembly code to your compiler, you will need a codegen to generate and optimize the assembly code from a Relay graph.&lt;/p&gt;
+
+&lt;p&gt;If your compilation flow falls into this case, then we recommend reading all the rest sections in this post but skipping &lt;a href=&quot;#bring-dnnl-to-tvm-json-codegenruntime&quot;&gt;Bring DNNL to TVM: JSON Codegen/Runtime&lt;/a&gt;.&lt;/p&gt;
+
+&lt;h2 id=&quot;how-byoc-works&quot;&gt;How BYOC Works&lt;/h2&gt;
+
+&lt;p&gt;We then briefly explain how BYOC framework works. For more detail explanations of underlying framework components and their implementations, please refer to the &lt;a href=&quot;[https://tvm.apache.org/docs/dev/relay_bring_your_own_codegen.html](https://tvm.apache.org/docs/dev/relay_bring_your_own_codegen.html)&quot;&gt;developer document&lt;/a&gt;. In short, given a Relay graph in Figure 1, BYOC framework does the following steps:&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/bring-your-own-codegen/original_graph.png&quot; alt=&quot;The original Relay graph&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt;
+Figure 1: The Original Relay Graph.
+&lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;1-graph-annotation&quot;&gt;1. Graph Annotation&lt;/h3&gt;
+&lt;p&gt;Taking a user-provided Relay graph, our first step is to annotate the nodes that potentially can be offloaded to your accelerator in the graph. You will need to follow &lt;a href=&quot;#bring-dnnl-to-tvm-annotation-rules&quot;&gt;Bring DNNL to TVM: Annotation Rules&lt;/a&gt; to implement a whitelist of supported operators, or a graph pattern list of customized composite operators. An example annotation result is shown in Figure 2.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/bring-your-own-codegen/after_annotation.png&quot; alt=&quot;The Graph with Annotations&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt;
+Figure 2: The Graph with Annotations.
+&lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;2-graph-transformation&quot;&gt;2. Graph Transformation&lt;/h3&gt;
+&lt;p&gt;The second step is to transform and optimize the graph based on the annotations. Specifically, BYOC performs the following transformations.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;2.1: Merge compiler region&lt;/strong&gt;: As can be seen in Figure 2, we now have many “regions” in the graph that can be offloaded to your accelerator, but some of them can actually be merged to reduce the data transfer and kernel launching overhead. Accordingly, step 2.1 uses a greedy algorithm to merge as many of those regions as possible while guaranteeing the functional correctness. The result is depicted in Figure 3.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/bring-your-own-codegen/after_merging_regions.png&quot; alt=&quot;After Merging Compiler Regions&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt;
+Figure 3: After Merging Compiler Regions.
+&lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;2.2: Partition Graph&lt;/strong&gt;: For each region from the previous step, we create a Relay function with an attribute &lt;code class=&quot;highlighter-rouge&quot;&gt;Compiler&lt;/code&gt; to indicate that this Relay function should be entirely offloaded to your accelerator, as shown in Figure 4.&lt;/p&gt;
+
+&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/bring-your-own-codegen/after_partitioning.png&quot; alt=&quot;After Graph Partitioning&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
+&lt;center&gt;
+Figure 4: After Graph Partitioning.
+&lt;/center&gt;
+&lt;p&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;3-code-generation&quot;&gt;3. Code Generation&lt;/h3&gt;
+&lt;p&gt;Now we know which part of the Relay graph should be offloaded. In this step, we sequentially send every Relay function with &lt;code class=&quot;highlighter-rouge&quot;&gt;Compiler=your_accelerator&lt;/code&gt; to your codegen. Your codegen should compile the Relay function to the form that matches your own compilation flow. It can be either C source code or any text formats.&lt;/p&gt;
+
+&lt;p&gt;Finally, all compiled functions will be serialized along with other non-offloaded Relay functions to a single &lt;code class=&quot;highlighter-rouge&quot;&gt;.so&lt;/code&gt; file by the TVM &lt;code class=&quot;highlighter-rouge&quot;&gt;export_library&lt;/code&gt; Python API. In other words, the user will get only one &lt;code class=&quot;highlighter-rouge&quot;&gt;.so&lt;/code&gt; file after running this flow.&lt;/p&gt;
+
+&lt;h3 id=&quot;4-runtime&quot;&gt;4. Runtime&lt;/h3&gt;
+&lt;p&gt;You may also need to implement a runtime to initialize your graph engine (if applicable) and execute the compiled functions. During the inference, TVM runtime (i.e., graph runtime or VM) will leverage your runtime to invoke the offloaded functions when the TVM runtime encounters the corresponding function call in Figure 4. Your runtime is responsible for launching the compiled function with the given input tensor arrays and filling in the results to the output tensor arrays.&lt;/p&gt;
+
+&lt;p&gt;In the rest of this post, we use DNNL as an example to demonstrate how to achieve the above workflow using the BYOC framework. Please note that all referred code and line number in this post are based on the TVM repository’s master branch commit &lt;a href=&quot;https://github.com/apache/incubator-tvm/tree/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8&quot;&gt;8a0249c&lt;/a&gt;.&lt;/p&gt;
+
+&lt;h2 id=&quot;bring-dnnl-to-tvm-annotation-rules&quot;&gt;Bring DNNL to TVM: Annotation Rules&lt;/h2&gt;
+
+&lt;p&gt;The BYOC framework provides two approaches for you to describe the supported operators and patterns. You can use both of them simultaneously. In this section, we use DNNL as an example to show how to make use of them. The complete implementation is available &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/python/tvm/relay/op/contrib/dnnl.py&quot;&gt;here&lt;/a&gt;. Note that we put the annotation rules for your codegen under [...]
+
+&lt;h3 id=&quot;rules-for-single-operators&quot;&gt;Rules for single operators&lt;/h3&gt;
+&lt;p&gt;You can intuitively specify which Relay operators are supported by your accelerator with the BYOC API. For example, we use the following code snippet to build a rule saying that our DNNL codegen supports Conv2D:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;register_op_attr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt [...]
+&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;_dnnl_conv2d_wrapper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;attrs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+&lt;p&gt;This registers a new attribute &lt;code class=&quot;highlighter-rouge&quot;&gt;target.dnnl&lt;/code&gt; to Relay &lt;code class=&quot;highlighter-rouge&quot;&gt;nn.conv2d&lt;/code&gt; operator.  By this way, the BYOC annotation could invoke &lt;code class=&quot;highlighter-rouge&quot;&gt;target.dnnl()&lt;/code&gt; for every operator in the graph to check if it is supported in DNNL codegen.&lt;/p&gt;
+
+&lt;p&gt;On the other hand, it might be tedious to write the above code snippet for every single operator. For the DNNL implementation, we implemented a helper function, &lt;code class=&quot;highlighter-rouge&quot;&gt;_register_external_op_helper&lt;/code&gt;, to make our life easier:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;supported&lt;/span&gt;&lt;span class=&q [...]
+    &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;register_op_attr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;target.dnnl [...]
+    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;_func_wrapper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;attrs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
+        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;supported&lt;/span&gt;
+    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_func_wrapper&lt;/span&gt;
+
+&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;nn.batch_norm&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;nn.conv2d&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;nn.dense&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;nn.relu&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;add&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;subtract&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;multiply&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+&lt;p&gt;In the above example, we specify a list of operators that can be supported by DNNL codegen.&lt;/p&gt;
+
+&lt;h3 id=&quot;rules-for-graph-patterns&quot;&gt;Rules for graph patterns&lt;/h3&gt;
+&lt;p&gt;Your accelerator or compiler may have optimized some patterns (e.g., Conv2D + add + ReLU) to be a single instruction or an API. In this case, you can specify a mapping from a graph pattern to your instruction/API. For the case of the DNNL, its Conv2D API already includes bias addition and it allows the next ReLU to be attached, so we can call DNNL as the following code snippet (the complete implementation can be found &lt;a href=&quot;[https://github.com/apache/incubator-tvm/blo [...]
+
+&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;DNNLConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;has_bias&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;false&lt [...]
+  &lt;span class=&quot;c1&quot;&gt;// ... skip ...&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv_desc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;convolution_forward&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;desc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prop_kind&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;forward_inference&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;algorithm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;convolution_direct&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;conv_src_md&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv_weights_md&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv_bias_md&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv_dst_md&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;strides_dims&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;padding_dims_l&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;padding_dims_r&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+
+  &lt;span class=&quot;c1&quot;&gt;// Attach ReLU&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;primitive_attr&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;attr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;has_relu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;post_ops&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append_eltwise&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;spa [...]
+    &lt;span class=&quot;n&quot;&gt;attr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_post_ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+
+  &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv2d_prim_desc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;convolution_forward&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;primitive_desc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;conv_desc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;attr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;engine_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+  &lt;span class=&quot;c1&quot;&gt;// ... skip ...&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+&lt;p&gt;In this case, except for a single &lt;code class=&quot;highlighter-rouge&quot;&gt;conv2d&lt;/code&gt;, we would like to map the graph pattern &lt;code class=&quot;highlighter-rouge&quot;&gt;conv2d+relu&lt;/code&gt; to &lt;code class=&quot;highlighter-rouge&quot;&gt;DNNLConv2d(false, true)&lt;/code&gt;, and map &lt;code class=&quot;highlighter-rouge&quot;&gt;conv2d+add+relu&lt;/code&gt; to &lt;code class=&quot;highlighter-rouge&quot;&gt;DNNLConv2d(true, true)&lt;/code&gt;. We can [...]
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;make_pattern&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_bias&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&l [...]
+  &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wildcard&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wildcard&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;bias&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wildcard&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;conv&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'nn.conv2d'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;with_bias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;conv_out&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'add'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;conv_out&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'nn.relu'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conv_out&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+
+&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;register_pattern_table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;dnnl&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;pattern_table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;conv2d_bias_relu_pat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;dnnl.conv2d_bias_relu&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;make_pattern&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_bias&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt [...]
+  &lt;span class=&quot;n&quot;&gt;conv2d_relu_pat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;dnnl.conv2d_relu&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;make_pattern&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_bias&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span clas [...]
+  &lt;span class=&quot;n&quot;&gt;dnnl_patterns&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conv2d_bias_relu_pat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv2d_relu_pat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_patterns&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;In the DNNL example, we implemented two patterns with different names so that we can easily recognize them in the codegen. Note that the patterns are implemented in the Relay pattern language. You can follow &lt;a href=&quot;https://tvm.apache.org/docs/langref/relay_pattern.html&quot;&gt;this tutorial&lt;/a&gt; to learn how to write your own patterns.&lt;/p&gt;
+
+&lt;p&gt;With the pattern table, we can then use a Relay pass to perform the transformation from&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;%1 = nn.conv2d(%data, %weight, ...)
+%2 = add(%1, %bias)
+%3 = nn.relu(%2)
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+&lt;p&gt;to&lt;/p&gt;
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;%1 = fn(%input1, %input2, %input3,
+        Composite=&quot;dnnl.conv2d_bias_relu&quot;,
+        PartitionedFromPattern=&quot;nn.conv2d_add_nn.relu_&quot;) {
+  %1 = nn.conv2d(%input1, %input2, ...)
+  %2 = add(%1, %input3)
+  nn.relu(%2)
+}
+%2 = %1(%data, %weight, %bias)
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+&lt;p&gt;Thus, the DNNL codegen can get the pattern name &lt;code class=&quot;highlighter-rouge&quot;&gt;conv2d_bias_relu&lt;/code&gt; and map &lt;code class=&quot;highlighter-rouge&quot;&gt;%1&lt;/code&gt; to &lt;code class=&quot;highlighter-rouge&quot;&gt;DNNLConv2d(true, true)&lt;/code&gt;.&lt;/p&gt;
+
+&lt;p&gt;As you may have noticed that we also have an attribute called “PartitionedFromPattern” in the composite function. This could be helpful if your pattern contains &lt;code class=&quot;highlighter-rouge&quot;&gt;wildcard&lt;/code&gt; operators. For example we may have a pattern table &lt;code class=&quot;highlighter-rouge&quot;&gt;(&quot;conv2d_with_something&quot;, conv2d -&amp;gt; *)&lt;/code&gt;:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;make_pattern&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_bias&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&l [...]
+  &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wildcard&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wildcard&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;conv&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'nn.conv2d'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wildcard&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+&lt;p&gt;In this case, you will get a composite function with &lt;code class=&quot;highlighter-rouge&quot;&gt;Composite=conv2d_with_something&lt;/code&gt;, but you have no idea about what graph it actually matched. That’s where PartitionedFromPattern comes into play. You can know that if the matched graph is &lt;code class=&quot;highlighter-rouge&quot;&gt;conv2d -&amp;gt; add&lt;/code&gt; or &lt;code class=&quot;highlighter-rouge&quot;&gt;conv2d -&amp;gt; relu&lt;/code&gt; by looking at  [...]
+
+&lt;h2 id=&quot;bring-dnnl-to-tvm-relay-graph-transformation&quot;&gt;Bring DNNL to TVM: Relay Graph Transformation&lt;/h2&gt;
+&lt;p&gt;With the annotation rules from the previous step, we can now apply a list of BYOC Relay passes to transform the Relay graph from Figure 1 to Figure 4:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;create_relay_module_from_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# Output: Figure 1
+&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;MergeComposite&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pattern_table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;p&qu [...]
+&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnnotateTarget&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;dnnl&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;) [...]
+&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;MergeCompilerRegions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# Output: Figure 3
+&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PartitionGraph&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# Output: Figure 4
+&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+&lt;p&gt;As can be seen, each Relay pass can be mapped to a step we have introduced in &lt;a href=&quot;#how-byoc-works&quot;&gt;How BYOC Works&lt;/a&gt;.&lt;/p&gt;
+
+&lt;h2 id=&quot;bring-dnnl-to-tvm-json-codegenruntime&quot;&gt;Bring DNNL to TVM: JSON Codegen/Runtime&lt;/h2&gt;
+&lt;p&gt;Now let’s implement the DNNL codegen that serializes a Relay graph to a JSON representation, and then implement the DNNL JSON runtime to deserialize and execute the graph. &lt;em&gt;Note that if you attempt to implement a codegen to generate C-compatible programs, you may want to directly proceed to the next section.&lt;/em&gt;&lt;/p&gt;
+
+&lt;p&gt;To enable DNNL JSON codegen/runtime in TVM to work on this example, please make sure DNNL is available on your machine, and build the TVM with &lt;code class=&quot;highlighter-rouge&quot;&gt;set(USE_DNNL_CODEGEN ON)&lt;/code&gt; in &lt;code class=&quot;highlighter-rouge&quot;&gt;config.cmake&lt;/code&gt;.&lt;/p&gt;
+
+&lt;p&gt;The DNNL codegen is implemented in &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;src/relay/backend/contrib/dnnl/codegen.cc&lt;/code&gt;&lt;/a&gt;. Since we implemented DNNL codegen in both forms in this file for illustration purpose, you could focus on the part covered by &lt;code class=&quot;highlighter-rouge&quot;&gt;USE_JS [...]
+
+&lt;p&gt;We first register the codegen with TVM registration API (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L510&quot;&gt;L510&lt;/a&gt;). This registration makes TVM compile engine dispatch the Relay function with &lt;code class=&quot;highlighter-rouge&quot;&gt;Compiler=&amp;lt;your codegen&amp;gt;&lt;/code&gt;  to &lt;code class=&quot;highlighter-rouge&quot;&gt;relay.ext.&amp;lt;your  [...]
+
+&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;DNNLCompiler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Object [...]
+  &lt;span class=&quot;c1&quot;&gt;// &quot;ref&quot; should be the paritioned Relay function with kCompiler=dnnl.&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;CHECK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ref&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IsInstance&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FunctionNode&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;());&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;func&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Downcast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Function&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ref&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt [...]
+
+  &lt;span class=&quot;c1&quot;&gt;// Get the function name as the symbol to match in runtime.&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;func_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GetExtSymbol&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;func&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+
+  &lt;span class=&quot;c1&quot;&gt;// Serialize the function to a JSON string (introduce later).&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;DNNLJSONSerializer&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;serializer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;func_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;func&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;serializer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;serialize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;graph_json&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;serializer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetJSON&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
+
+  &lt;span class=&quot;c1&quot;&gt;// The constant tensor names that have been bound to the module.&lt;/span&gt;
+  &lt;span class=&quot;c1&quot;&gt;// All constant tensors will be serialzied along with the JSON graph&lt;/span&gt;
+  &lt;span class=&quot;c1&quot;&gt;// when export_library is invoked.&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;serializer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetParams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
+
+  &lt;span class=&quot;c1&quot;&gt;// The function to create DNNL JSON runtime (introduce later).&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Registry&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Get&lt;/span&gt;& [...]
+  &lt;span class=&quot;n&quot;&gt;CHECK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nullptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Cannot find JSON runtime module to create&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+
+  &lt;span class=&quot;c1&quot;&gt;// Create a DNNL runtime module that can run the serialized function.&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;func_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;graph_json&lt;/span&gt;&l [...]
+  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;TVM_REGISTER_GLOBAL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;relay.ext.dnnl&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_body_typed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNLCompiler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Note that &lt;strong&gt;&lt;em&gt;each runtime module is only responsible for one Relay function, meaning that you may have several DNNL runtime modules in a single &lt;code class=&quot;highlighter-rouge&quot;&gt;.so&lt;/code&gt; file.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;dnnl-json-serialization&quot;&gt;DNNL JSON Serialization&lt;/h3&gt;
+&lt;p&gt;Next, we implement DNNL JSON serializer (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L429&quot;&gt;L429&lt;/a&gt;). We derived it from the BYOC JSON codegen (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/codegen_json/codegen_json.h&quot;&gt;src/relay/backend/contrib/codegen_json/codegen_json.h&lt;/ [...]
+
+&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+  &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;op:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;kernel&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+  &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;name:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;dnnl.conv2d_relu&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+  &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;inputs:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;s [...]
+  &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;attrs:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+    &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;PartitionedFromPattern:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;nn.conv2d_nn.relu_&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+    &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;shape:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt [...]
+  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+&lt;p&gt;The problem is that we still need the Conv2D attributes such as padding and strides in runtime, but the BYOC JSON serializer only attaches the attributes of the composite function instead of the body operators. On the other hand, the customized DNNL JSON serializer attaches the attributes of the first and only Conv2D in the composite function to generate the following JSON node:&lt;/p&gt;
+
+&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+  &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;op:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;kernel&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+  &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;name:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;dnnl.conv2d_relu&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+  &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;inputs:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;s [...]
+  &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;attrs:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+    &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;shape:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt [...]
+    &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;data_layout:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;NCHW&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+    &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;kernel_layout:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;OIHW&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+    &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;strides:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+    &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;padding:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt [...]
+  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
+&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;As can be seen from the DNNL JSON serializer, you can customize the serializer to generate any forms in JSON you like as long as your JSON runtime could interpret them.&lt;/p&gt;
+
+&lt;h3 id=&quot;dnnl-json-runtime&quot;&gt;DNNL JSON Runtime&lt;/h3&gt;
+
+&lt;p&gt;We then implement a DNNL JSON runtime to interpret and execute the serialized JSON graph. We put it under &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/runtime/contrib/dnnl/dnnl_json_runtime.cc&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;src/runtime/contrib/dnnl/dnnl_json_runtime.cc&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;Again, we first register two APIs to create the runtime so that we can use them anywhere. The &lt;code class=&quot;highlighter-rouge&quot;&gt;runtime.DNNLJSONRuntimeCreate&lt;/code&gt; is used in the previous part after serialization, and &lt;code class=&quot;highlighter-rouge&quot;&gt;runtime.module.loadbinary_dnnl_json&lt;/code&gt; could be used when loading the &lt;code class=&quot;highlighter-rouge&quot;&gt;.so&lt;/code&gt; back.&lt;/p&gt;
+
+&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Create a DNNL JSON runtime to interpret and execute the given JSON graph.&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;DNNLJSONRuntimeCreate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;symbol_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot; [...]
+                                      &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;const_names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;make_object&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNLJSONRuntime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;symbol_name&lt;/span&gt;&lt;span class=&quot;p [...]
+  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;TVM_REGISTER_GLOBAL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;runtime.DNNLJSONRuntimeCreate&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+    &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_body_typed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNLJSONRuntimeCreate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+
+&lt;span class=&quot;n&quot;&gt;TVM_REGISTER_GLOBAL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;runtime.module.loadbinary_dnnl_json&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+    &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_body_typed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;JSONRuntimeBase&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LoadFromBinary&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNLJSONRuntime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;s [...]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Now we explain DNNL JSON runtime implementation. The basic class structure is:&lt;/p&gt;
+
+&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DNNLJSONRuntime&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;JSONRuntimeBase&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt;  &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;type_key&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt;  &lt;span class=&quot;s&quot;&gt;&quot;dnnl_json&quot;&lt;/span&gt;&lt;span class=&quot;p& [...]
+  &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Init&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NDArray&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;consts&lt;/span&gt;&lt;span class=&quot;p&q [...]
+    &lt;span class=&quot;c1&quot;&gt;// Initialize the DNNL graph engine.&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;BuildEngine&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
+    
+    &lt;span class=&quot;c1&quot;&gt;// Setup constants entries for weights.&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;CHECK_EQ&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;consts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;const_idx_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;
+      &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;The number of input constants must match the number of required.&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;SetupConstants&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;consts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+
+  &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Run&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;override&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
+   &lt;span class=&quot;c1&quot;&gt;// 1. Fill in the input buffers.&lt;/span&gt;
+   &lt;span class=&quot;c1&quot;&gt;// 2. Invoke the engine through intepreting the stream.&lt;/span&gt;
+   &lt;span class=&quot;c1&quot;&gt;// 3. Read and fill output buffers.&lt;/span&gt;
+  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;Init&lt;/code&gt; function is in charge of building the DNNL engine by interpreting the JSON graph string (see &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/runtime/contrib/dnnl/dnnl_json_runtime.cc#L93&quot;&gt;L93&lt;/a&gt; for &lt;code class=&quot;highlighter-rouge&quot;&gt;BuildEngine&lt;/code&gt;), and filling the constant weights to the corresponding data entry  [...]
+
+&lt;p&gt;Next, the &lt;code class=&quot;highlighter-rouge&quot;&gt;Run&lt;/code&gt; function (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/runtime/contrib/dnnl/dnnl_json_runtime.cc#L64&quot;&gt;L64&lt;/a&gt;) first writes the input tensors, which may come from user inputs or constant weights, to the corresponding DNNL memory buffers we initialized when building the DNNL engine. Then launch the DNNL engine to execute the JSON g [...]
+
+&lt;p&gt;Since the rest implementation in DNNL JSON runtime are too DNNL specific to be dived into details in this post, we will stop here. We would like to emphasize that while the DNNL JSON runtime is a good reference to start with, your JSON runtime could be fully customized to fit your requirements.&lt;/p&gt;
+
+&lt;h2 id=&quot;bring-dnnl-to-tvm-c-source-codegen&quot;&gt;Bring DNNL to TVM: C Source Codegen&lt;/h2&gt;
+&lt;p&gt;Now let’s implement the DNNL codegen that generates C source code which invokes DNNL APIs to execute the Relay graph.&lt;em&gt;Note that if you attempt to implement a codegen to generate other graph representation like in JSON format, you may want to read &lt;a href=&quot;#bring-dnnl-to-tvm-json-codegenruntime&quot;&gt;Bring DNNL to TVM: JSON Codegen/Runtime&lt;/a&gt; and skip this section.&lt;/em&gt;&lt;/p&gt;
+
+&lt;p&gt;To enable DNNL C source codegen in TVM to work on this example, please make sure DNNL is available on your machine, and build the TVM with &lt;code class=&quot;highlighter-rouge&quot;&gt;set(USE_DNNL_CODEGEN C_SRC)&lt;/code&gt; in &lt;code class=&quot;highlighter-rouge&quot;&gt;config.cmake&lt;/code&gt;.&lt;/p&gt;
+
+&lt;p&gt;The DNNL codegen is implemented in &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;src/relay/backend/contrib/dnnl/codegen.cc&lt;/code&gt;&lt;/a&gt;. Since we implemented DNNL codegen in both forms in this file for illustration purpose, you could focus on the part &lt;strong&gt;NOT&lt;/strong&gt; covered by &lt;code class=&quot; [...]
+
+&lt;p&gt;We first register the codegen with TVM registration API (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L510&quot;&gt;L510&lt;/a&gt;). This registration makes TVM compile engine dispatch the Relay function with &lt;code class=&quot;highlighter-rouge&quot;&gt;Compiler=&amp;lt;your codegen&amp;gt;&lt;/code&gt;  to &lt;code class=&quot;highlighter-rouge&quot;&gt;relay.ext.&amp;lt;your  [...]
+
+&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;DNNLCompiler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Object [...]
+  &lt;span class=&quot;n&quot;&gt;DNNLModuleCodegen&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CreateCSourceModule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ref&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;TVM_REGISTER_GLOBAL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;relay.ext.dnnl&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_body_typed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNLCompiler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Note that &lt;strong&gt;&lt;em&gt;each runtime module is only responsible for one Relay function, meaning that you may have several DNNL runtime modules in a single &lt;code class=&quot;highlighter-rouge&quot;&gt;.so&lt;/code&gt; file.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;Then, we derive &lt;code class=&quot;highlighter-rouge&quot;&gt;CSourceModuleCodegenBase&lt;/code&gt; to implement  &lt;code class=&quot;highlighter-rouge&quot;&gt;DNNLModuleCodegen&lt;/code&gt; in &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L362&quot;&gt;L362&lt;/a&gt;. While &lt;code class=&quot;highlighter-rouge&quot;&gt;CSourceModuleCodegenBase&lt;/code&gt; is in charge of ot [...]
+
+&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CreateCSourceModule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt; [...]
+    &lt;span class=&quot;c1&quot;&gt;// Include headers&lt;/span&gt;
+    &lt;span class=&quot;c1&quot;&gt;// ...skip...&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;code_stream_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;#include &amp;lt;dnnl/dnnl_kernel.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+    &lt;span class=&quot;c1&quot;&gt;// ...skip...&lt;/span&gt;
+
+    &lt;span class=&quot;c1&quot;&gt;// &quot;ref&quot; should be the paritioned Relay function with kCompiler=dnnl.&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;CHECK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ref&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IsInstance&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FunctionNode&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;());&lt;/span&gt;
+    &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;res&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GenDNNLFunc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Downcast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Function&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot; [...]
+
+    &lt;span class=&quot;c1&quot;&gt;// &quot;code&quot; is the generated C code with DNNL APIs.&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;code&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;code_stream_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
+
+    &lt;span class=&quot;c1&quot;&gt;// &quot;res&quot; is a tuple of constant weights (symbols, values).&lt;/span&gt;
+    &lt;span class=&quot;c1&quot;&gt;// All constant tensors will be serialzied along with the generated C code&lt;/span&gt;
+    &lt;span class=&quot;c1&quot;&gt;// when export_library is invoked.&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sym&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&g [...]
+    &lt;span class=&quot;n&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;variables&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&am [...]
+
+    &lt;span class=&quot;c1&quot;&gt;// Create a CSource module with all above artifacts.&lt;/span&gt;
+    &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Registry&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Get&lt;/span&gt [...]
+    &lt;span class=&quot;n&quot;&gt;CHECK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nullptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Cannot find csource module to create the external runtime module&quot;&lt;/span&gt;&lt;span class [...]
+    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;code&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;c&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sym&lt;/span&gt;& [...]
+  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Next, we implement &lt;code class=&quot;highlighter-rouge&quot;&gt;GenDNNLFunc&lt;/code&gt; (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L365&quot;&gt;L365&lt;/a&gt;) to generate the compilable C code with DNNL APIs as follows. Please see the embedded comments for the explanations of TVM C source runtime module compatible function interfaces.&lt;/p&gt;
+
+&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// The example Relay graph: conv2d -&amp;gt; add -&amp;gt; relu.&lt;/span&gt;
+&lt;span class=&quot;cp&quot;&gt;#include &amp;lt;cstdint&amp;gt;
+#include &amp;lt;cstdlib&amp;gt;
+#include &amp;lt;cstring&amp;gt;
+#include &amp;lt;vector&amp;gt;
+#include &amp;lt;tvm/runtime/c_runtime_api.h&amp;gt;
+#include &amp;lt;tvm/runtime/container.h&amp;gt;
+#include &amp;lt;tvm/runtime/packed_func.h&amp;gt;
+#include &amp;lt;dlpack/dlpack.h&amp;gt;
+#include &amp;lt;dnnl/dnnl_kernel.h&amp;gt;
+&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;namespace&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;namespace&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;contrib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+
+&lt;span class=&quot;c1&quot;&gt;// Execute the conv2d-&amp;gt;add-&amp;gt;relu graph with DNNL.&lt;/span&gt;
+&lt;span class=&quot;k&quot;&gt;extern&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;C&quot;&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;dnnl_0_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_0_i0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt [...]
+                        &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_0_i2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
+  &lt;span class=&quot;c1&quot;&gt;// Allocate intermediate buffers.&lt;/span&gt;
+  &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span c [...]
+  &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span c [...]
+  &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span c [...]
+
+  &lt;span class=&quot;c1&quot;&gt;// Pre-implemented op-based DNNL functions.&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;dnnl_conv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dnnl_0_i0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_0_i1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span [...]
+  &lt;span class=&quot;n&quot;&gt;dnnl_add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;buf_0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_0_i2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &l [...]
+  &lt;span class=&quot;n&quot;&gt;dnnl_relu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;buf_1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;spa [...]
+
+  &lt;span class=&quot;c1&quot;&gt;// Copy the final output to the corresponding buffer.&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;memcpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span c [...]
+  &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;free&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;buf_0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;free&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;buf_1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;free&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;buf_2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+
+&lt;span class=&quot;c1&quot;&gt;// The wrapper function with all arguments in DLTensor type.&lt;/span&gt;
+&lt;span class=&quot;k&quot;&gt;extern&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;C&quot;&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;dnnl_0_wrapper_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DLTensor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arg0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
+        &lt;span class=&quot;n&quot;&gt;DLTensor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arg1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
+        &lt;span class=&quot;n&quot;&gt;DLTensor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arg2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
+        &lt;span class=&quot;n&quot;&gt;DLTensor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
+
+  &lt;span class=&quot;c1&quot;&gt;// Cast all DLTensor to primitive type buffers and invoke the above&lt;/span&gt;
+  &lt;span class=&quot;c1&quot;&gt;// execution function.&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;dnnl_0_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;static_cast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;& [...]
+  &lt;span class=&quot;n&quot;&gt;static_cast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;static_cast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
+  &lt;span class=&quot;n&quot;&gt;static_cast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
+  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
+
+&lt;span class=&quot;c1&quot;&gt;// The TVM macro to generate TVM runtime compatible function &quot;dnnl_0&quot;&lt;/span&gt;
+&lt;span class=&quot;c1&quot;&gt;// from our generated &quot;dnnl_0_wrapper_&quot;.&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;TVM_DLL_EXPORT_TYPED_FUNC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dnnl_0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_0_wrapper_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;Note that the pre-implemented op-based DNNL functions are in &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/runtime/contrib/dnnl/dnnl.cc&quot;&gt;src/runtime/contrib/dnnl/dnnl.cc&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;Since the rest implementation in &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;src/relay/backend/contrib/dnnl/codegen.cc&lt;/code&gt;&lt;/a&gt; are too DNNL specific to be dived into details in this post, we will stop here. The main idea is implementing a Relay graph visitor (&lt;a href=&quot;https://github.com/apache/incubat [...]
+
+&lt;h3 id=&quot;c-source-compilation&quot;&gt;C Source Compilation&lt;/h3&gt;
+&lt;p&gt;As you may have noticed, the output of &lt;code class=&quot;highlighter-rouge&quot;&gt;DNNLCompiler&lt;/code&gt; is a module with the generated C code in text format, which has not been compiled by &lt;code class=&quot;highlighter-rouge&quot;&gt;gcc&lt;/code&gt; to be executable binary. In fact, the generated C code will be compiled when users call &lt;code class=&quot;highlighter-rouge&quot;&gt;export_libray(mod)&lt;/code&gt;, like the following code snippet:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;update_lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
+    &lt;span class=&quot;c1&quot;&gt;# Include the path of src/runtime/contrib/dnnl/dnnl.cc
+&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;test_dir&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dirname&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/spa [...]
+    &lt;span class=&quot;n&quot;&gt;source_dir&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;test_dir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &l [...]
+    &lt;span class=&quot;n&quot;&gt;contrib_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;source_dir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt [...]
+
+    &lt;span class=&quot;c1&quot;&gt;# Setup the gcc flag to compile DNNL code.
+&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;kwargs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;kwargs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;options&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;-O2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;-std=c++14&quot;&lt;/span&gt;&lt;span clas [...]
+    &lt;span class=&quot;n&quot;&gt;tmp_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;util&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tempdir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;lib_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'lib.so'&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;lib_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tmp_path&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relpath&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lib_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+
+    &lt;span class=&quot;c1&quot;&gt;# The generated C code with DNNL APIs is compiled to a binary lib.so.
+&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;export_library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lib_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fcompile&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quo [...]
+
+    &lt;span class=&quot;c1&quot;&gt;# Load the lib.so back to a runtime module.
+&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;load_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lib_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;
+
+&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PassContext&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;opt_level&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt [...]
+    &lt;span class=&quot;n&quot;&gt;json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;param&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;spa [...]
+&lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;update_lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;rt_mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;contrib&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;graph_runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; [...]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;h2 id=&quot;bring-dnnl-to-tvm-build-tvm-with-dnnl-codegenruntime&quot;&gt;Bring DNNL to TVM: Build TVM with DNNL Codegen/Runtime&lt;/h2&gt;
+&lt;p&gt;Finally, we create &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/cmake/modules/contrib/DNNL.cmake&quot;&gt;cmake/modules/contrib/DNNL.cmake&lt;/a&gt; to include the DNNL codegen when building TVM. For demonstration purpose our DNNL codegen has two implementations in the same cmake file. You can only focus on one of them based on your need.&lt;/p&gt;
+
+&lt;p&gt;With the cmake file ready, now users can specify &lt;code class=&quot;highlighter-rouge&quot;&gt;set(USE_DNNL_CODEGEN ON)&lt;/code&gt; in their &lt;code class=&quot;highlighter-rouge&quot;&gt;build/config.cmake&lt;/code&gt; to enable the DNNL codegen.&lt;/p&gt;
+
+&lt;hr /&gt;
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;a href=&quot;https://github.com/zhiics&quot;&gt;Zhi Chen&lt;/a&gt; is a TVM PMC member as well as a senior engineer at SageMaker Neo, Amazon AI, AWS.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;a href=&quot;https://comaniac.github.io&quot;&gt;Cody Yu&lt;/a&gt; is a TVM reviewer as well as an applied scientist at Amazon AI, AWS.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h2 id=&quot;acknowledgment&quot;&gt;Acknowledgment&lt;/h2&gt;
+
+&lt;p&gt;We would like to thank our colleague Animesh Jain for valuable discussions in the framework design; Tianqi Chen and Jared Roesch from OctoML for system design discussions and prototyping; Masahiro Masuda from the TVM community to help code review and improve the DNNL integration. We would also like to thank Ramana Radhakrishnan, Matthew Barrett, Manupa Karunaratne, and Luke Hutton from ARM, U.K. for contributing several helpful ideas, related Relay passes, and the Arm Compute Li [...]
+
+</description>
+                <link>https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm</link>
+                <guid>https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm</guid>
+                <pubDate>Wed, 15 Jul 2020 00:00:00 -0700</pubDate>
+        </item>
+
+        <item>
                 <title>Bridging PyTorch and TVM</title>
                 <description>
 &lt;p&gt;(A more code-heavy variant is crossposted on the more PyTorch affine &lt;a href=&quot;https://lernapparat.de/transformers-pytorch-tvm/&quot;&gt;Lernapparat&lt;/a&gt;,
@@ -3896,13 +4375,13 @@ We are starting to look at performance optimization and we expect more improveme
 &lt;p&gt;You should see something like this:&lt;/p&gt;
 
 &lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-llvm&quot; data-lang=&quot;llvm&quot;&gt;&lt;span class=&quot;c1&quot;&gt;; ModuleID = 'myadd__kernel0'&lt;/span&gt;
-&lt;span class=&quot;err&quot;&gt;sour&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;e_filename&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;myadd__kernel0&quot;&lt;/span&gt;
+&lt;span class=&quot;err&quot;&gt;source_filename&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;myadd__kernel0&quot;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;target&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;datalayout&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;e-p:32:32-p1:64:64-p2:64:64-p3:32:32-p4:64:64-p5:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64&quot;&lt;/span&gt;
 &lt;span class=&quot;k&quot;&gt;target&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;triple&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;amdgcn-amd-amdhsa-hcc&quot;&lt;/span&gt;
 
 
 &lt;span class=&quot;c1&quot;&gt;; Function Attrs: nounwind&lt;/span&gt;
-&lt;span class=&quot;k&quot;&gt;define&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;dllexport&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;amdgpu_ker&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;ne&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;l&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;vg&quot;&gt;@myadd__kernel0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k [...]
+&lt;span class=&quot;k&quot;&gt;define&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;dllexport&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;amdgpu_kernel&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;vg&quot;&gt;@myadd__kernel0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;addrspace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class [...]
 &lt;span class=&quot;nl&quot;&gt;entry:&lt;/span&gt;
   &lt;span class=&quot;nv&quot;&gt;%4&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;tail&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;call&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;i32&lt;/span&gt; &lt;span class=&quot;vg&quot;&gt;@llvm.amdgcn.workgroup.id.x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
   &lt;span class=&quot;nv&quot;&gt;%5&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;tail&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;call&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;i32&lt;/span&gt; &lt;span class=&quot;vg&quot;&gt;@llvm.amdgcn.workitem.id.x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
@@ -3922,14 +4401,14 @@ We are starting to look at performance optimization and we expect more improveme
   &lt;span class=&quot;nv&quot;&gt;%10&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;add&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;nsw&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;i32&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%.pre-phi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%5&lt;/span&gt;
   &lt;span class=&quot;nv&quot;&gt;%11&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;add&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;nsw&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;i32&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%.pre-phi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%5&lt;/span&gt;
   &lt;span class=&quot;nv&quot;&gt;%12&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sext&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;i32&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%11&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;to&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;i64&lt;/span&gt;
-  &lt;span class=&quot;nv&quot;&gt;%13&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;getelementptr&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;inbounds&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;rspa&lt;/span&gt;&lt;span class=&quot;k&quot;&gt [...]
-  &lt;span class=&quot;nv&quot;&gt;%14&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;load&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;rspa&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;e&lt;/span&gt; [...]
-  &lt;span class=&quot;nv&quot;&gt;%15&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;getelementptr&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;inbounds&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;rspa&lt;/span&gt;&lt;span class=&quot;k&quot;&gt [...]
-  &lt;span class=&quot;nv&quot;&gt;%16&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;load&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;rspa&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;e&lt;/span&gt; [...]
+  &lt;span class=&quot;nv&quot;&gt;%13&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;getelementptr&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;inbounds&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;addrspace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&g [...]
+  &lt;span class=&quot;nv&quot;&gt;%14&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;load&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;addrspace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)*&lt;/span&gt; [...]
+  &lt;span class=&quot;nv&quot;&gt;%15&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;getelementptr&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;inbounds&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;addrspace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&g [...]
+  &lt;span class=&quot;nv&quot;&gt;%16&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;load&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;addrspace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)*&lt;/span&gt; [...]
   &lt;span class=&quot;nv&quot;&gt;%17&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;fadd&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%14&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%16&lt;/span&gt;
   &lt;span class=&quot;nv&quot;&gt;%18&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sext&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;i32&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%10&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;to&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;i64&lt;/span&gt;
-  &lt;span class=&quot;nv&quot;&gt;%19&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;getelementptr&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;inbounds&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;rspa&lt;/span&gt;&lt;span class=&quot;k&quot;&gt [...]
-  &lt;span class=&quot;k&quot;&gt;store&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%17&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;rspa&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; [...]
+  &lt;span class=&quot;nv&quot;&gt;%19&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;getelementptr&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;inbounds&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;addrspace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&g [...]
+  &lt;span class=&quot;k&quot;&gt;store&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%17&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;addrspace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)*&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%19&lt;/span [...]
   &lt;span class=&quot;k&quot;&gt;br&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;label&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;%if_end&lt;/span&gt;
 
 
@@ -4099,585 +4578,6 @@ We also learns from Halide when implementing the lowering pipeline in TVM.&lt;/l
                 <pubDate>Fri, 06 Oct 2017 08:30:00 -0700</pubDate>
         </item>
 
-        <item>
-                <title>Optimize Deep Learning GPU Operators with TVM: A Depthwise Convolution Example</title>
-                <description>&lt;p&gt;Efficient deep learning operators are at the core of deep learning systems.
-Usually these operators are hard to optimize and require great efforts of HPC experts.
-&lt;a href=&quot;https://github.com/dmlc/tvm&quot;&gt;TVM&lt;/a&gt;, an end to end tensor IR/DSL stack, makes this much easier.&lt;/p&gt;
-
-&lt;p&gt;This blog teaches you how to write high-performance GPU operator kernels with the help of TVM.
-We use depthwise convolution (i.e. &lt;a href=&quot;http://docs.tvmlang.org/api/python/topi.html#topi.nn.depthwise_conv2d_nchw&quot;&gt;topi.nn.depthwise_conv2d_nchw&lt;/a&gt;) as an example,
-and demonstrate how we can improve over the already hand optimized CUDA kernel in tensorflow.
-Our final version is 2x-4x faster than the optimized kernel in tf-1.2 under different workloads, and 3x-7x faster with operator fusion enabled.
-Below is the result tested on GTX1080, with filter size = [1, 256, 3, 3], stride = [1, 1], padding = ‘SAME’:&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/depthconv_tutorial/tf_compare.png&quot; alt=&quot;image&quot; width=&quot;95%&quot; /&gt;&lt;/p&gt;
-
-&lt;h2 id=&quot;introduction-to-depthwise-convolution&quot;&gt;Introduction to Depthwise Convolution&lt;/h2&gt;
-
-&lt;p&gt;Depthwise convolution is an important building block of modern architectures, such as Xception [1] and MobileNet [2].
-It’s an effective method to reduce the computation complexity of deep neural networks.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/depthconv_tutorial/conv_and_depthconv.png&quot; alt=&quot;image&quot; width=&quot;80%&quot; /&gt;&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;source: &lt;a href=&quot;http://machinethink.net/blog/googles-mobile-net-architecture-on-iphone/&quot;&gt;http://machinethink.net/blog/googles-mobile-net-architecture-on-iphone/&lt;/a&gt;&lt;/p&gt;
-
-&lt;p&gt;In TVM, depthwise convolution can be declared as:&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# padding stage
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PaddedInput&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;compute&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
-    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;in_channel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;height_after_pad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;width_after_pad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
-    &lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=& [...]
-        &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;all&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pad_top&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;  [...]
-        &lt;span class=&quot;n&quot;&gt;Input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pad_top&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span  [...]
-    &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;PaddedInput&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
-&lt;span class=&quot;c1&quot;&gt;# depthconv stage
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;di&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reduce_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;filter_height&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),& [...]
-&lt;span class=&quot;n&quot;&gt;dj&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reduce_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;filter_width&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; & [...]
-&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;compute&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
-    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out_channel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out_height&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out_width&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
-    &lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=& [...]
-        &lt;span class=&quot;n&quot;&gt;PaddedInput&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;channel_multiplier&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/spa [...]
-        &lt;span class=&quot;n&quot;&gt;axis&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;di&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dj&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]),&lt;/span&gt;
-    &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'DepthwiseConv2d'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;h2 id=&quot;general-gpu-optimization-guidelines&quot;&gt;General GPU Optimization Guidelines&lt;/h2&gt;
-
-&lt;p&gt;This part briefly talks about three concepts we should know when optimizing CUDA code: data reuse, shared memory and bank conflicts.
-It would be great if you already know them, then you may skip this part.&lt;/p&gt;
-
-&lt;h3 id=&quot;data-reuse&quot;&gt;Data Reuse&lt;/h3&gt;
-&lt;p&gt;In modern computing architectures, the cost of loading data from memory is much higher than doing a single floating point computation [3].
-Because of that, we always want to reuse the input data after they are loaded into registers or shared memory (cache).&lt;/p&gt;
-
-&lt;p&gt;There are two forms of data reuse in depthwise convolution: filter reuse and input reuse. Filter reuse happens as the filter slides over the input channel and computes multiple times.
-Input reuse is realized through tiling, let’s take 3x3 depthwise conv as an example:&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/depthconv_tutorial/no_tiling.png&quot; alt=&quot;image&quot; width=&quot;70%&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Without tiling, each thread computes 1 output element and loads 3x3 input data. 16 threads together have 9x16 loads.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/depthconv_tutorial/tiling.png&quot; alt=&quot;image&quot; width=&quot;70%&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;With tiling, each thread computes 2x2 output elements and loads 4x4 input data. 4 threads together have 16x4 loads.&lt;/p&gt;
-
-&lt;h3 id=&quot;shared-memory-and-bank-conflicts&quot;&gt;Shared Memory and Bank Conflicts&lt;/h3&gt;
-&lt;p&gt;Shared memory can be seen as cache in GPU. It is on-chip and much faster than global memory.&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/depthconv_tutorial/GPU_memory_hierarchy.png&quot; alt=&quot;image&quot; width=&quot;256px&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Shared memory is allocated per block. It’s common practice to load data from global memory into shared memory, and then all threads in the block read data from shared memory.&lt;/p&gt;
-
-&lt;p&gt;The size of shared memory is limited (usually 48K), so we must be cautious of shared memory overflow.
-Besides, too much shared memory allocated to one block limits the number of active blocks per multiprocessor.&lt;/p&gt;
-
-&lt;p&gt;Another performance issue with shared memory is bank conflicts. Shared memory is divided into equally sized memory modules (banks) that can be accessed simultaneously,
-however, if multiple threads access the same memory bank (causing bank conflicts), the accesses will be serialized, thus decreasing the effective bandwidth.&lt;/p&gt;
-
-&lt;p&gt;Shared memory banks are organized such that successive addresses are assigned to successive banks.
-To avoid bank conflicts, it’s better that successive threads access successive memory addresses, as illustrated below (each color represents one shared memory bank):&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/depthconv_tutorial/bank_conflicts.png&quot; alt=&quot;image&quot; width=&quot;95%&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;For more details on shared memory and bank conflicts, please refer to &lt;a href=&quot;https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/&quot;&gt;this Nvidia’s blog&lt;/a&gt;.&lt;/p&gt;
-
-&lt;p&gt;Ok, now let’s start optimizing depthwise convolution in TVM.&lt;/p&gt;
-
-&lt;h2 id=&quot;schedule-optimization&quot;&gt;Schedule Optimization&lt;/h2&gt;
-
-&lt;h3 id=&quot;compute-paddedinput-inline-to-save-memory-allocation&quot;&gt;Compute PaddedInput Inline to Save Memory Allocation&lt;/h3&gt;
-&lt;p&gt;As we see from part 1, padding is declared explicitly as a separate stage. We compute it inline to avoid redundant memory allocation:&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create_schedule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/sp [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PaddedInput&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;compute_inline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;h3 id=&quot;divide-one-large-channel-into-smaller-blocks&quot;&gt;Divide One Large Channel into Smaller Blocks&lt;/h3&gt;
-&lt;p&gt;One straightforward schedule for depthwise convolution is that one cuda block takes care of one input channel and corresponding filters, loading them into shared memory and then computing:&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;IS&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cache_read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PaddedInput&lt;/spa [...]
-&lt;span class=&quot;n&quot;&gt;FS&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cache_read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Filter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;shared&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span& [...]
-&lt;span class=&quot;n&quot;&gt;block_y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thread_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;blockIdx.y&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;block_x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thread_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;blockIdx.x&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
-&lt;span class=&quot;c1&quot;&gt;# bind the dimension of batch (N in NCHW) with block_y
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op&lt;/span&gt;&lt;s [...]
-&lt;span class=&quot;c1&quot;&gt;# bind the dimension of channel (C in NCHW) with block_x
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op&lt;/span&gt;&lt;s [...]
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;We test the average time cost of 1000 runs on GTX 1080, and compare with &lt;a href=&quot;https://www.tensorflow.org/versions/r0.12/api_docs/python/nn/convolution#depthwise_conv2d&quot;&gt;depthwise_conv2d in tensorflow&lt;/a&gt;.
-Here is the result:&lt;/p&gt;
-
-&lt;table&gt;
-  &lt;thead&gt;
-    &lt;tr&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;Input&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;Filter&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;stride&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;tf-1.2 SAME pad (us)&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;TVM SAME pad (us)&lt;/th&gt;
-    &lt;/tr&gt;
-  &lt;/thead&gt;
-  &lt;tbody&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 21, 21]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[256, 1, 3, 3]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 1]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;16.1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;9.1&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 32, 32]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[256, 1, 3, 3]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 1]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;34.8&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;14.5&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 64, 64]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[256, 1, 3, 3]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 1]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;130.9&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;98.9&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[256, 1, 3, 3]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 1]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;251.6&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;387.4&lt;/td&gt;
-    &lt;/tr&gt;
-  &lt;/tbody&gt;
-&lt;/table&gt;
-
-&lt;p&gt;As we can see, this schedule performs well with small channel size like 21 x 21 or 32 x 32, however, its performance drops seriously as the channel size increases to larger than 64 x 64.
-One main reason is that too much shared memory allocated to one block limits the number of active blocks per multiprocessor.&lt;/p&gt;
-
-&lt;p&gt;We modify the schedule to divide one large channel into smaller blocks. For example, one channel (64 x 64 or 96 x 96) is divided into blocks of 32 x 32,
-and one cuda block takes care of one 32 x 32 block:&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;blocking_h&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;blocking_w&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;
-&lt;span class=&quot;c1&quot;&gt;# split the dimension of height (H in NCHW)
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bx1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;s [...]
-&lt;span class=&quot;c1&quot;&gt;# split the dimension of width (W in NCHW)
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bx2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;s [...]
-&lt;span class=&quot;c1&quot;&gt;# assign one 32 x 32 block to one cuda block
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fuse&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;block_y&lt;/span&gt;&lt;span class=& [...]
-&lt;span class=&quot;n&quot;&gt;bx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fuse&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bx1&lt;/span&gt;&lt;span class=&quo [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;block_x&lt;/span&gt;&lt;span class=& [...]
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;Here is the new result:&lt;/p&gt;
-
-&lt;table&gt;
-  &lt;thead&gt;
-    &lt;tr&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;Input&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;[blocking_h, blocking_w]&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;tf-1.2 SAME pad (us)&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;TVM SAME pad (us)&lt;/th&gt;
-    &lt;/tr&gt;
-  &lt;/thead&gt;
-  &lt;tbody&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 64, 64]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[32, 32]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;130.9&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;63.4&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[32, 32]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;251.6&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;132.5&lt;/td&gt;
-    &lt;/tr&gt;
-  &lt;/tbody&gt;
-&lt;/table&gt;
-
-&lt;p&gt;Our blocking strategy works! For 64 x 64 channel size, it brings 1.6x acceleration (98.9us -&amp;gt; 63.4us); for 96 x 96 channel size, it brings 2.9x acceleration (387.4us -&amp;gt; 132.5us).&lt;/p&gt;
-
-&lt;h3 id=&quot;tuning-parameters-of-thread-numbers&quot;&gt;Tuning Parameters of Thread Numbers&lt;/h3&gt;
-
-&lt;p&gt;How to schedule the workload, say, 32x32 among the threads of one cuda block? Intuitively, it should be like this:&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;num_thread_x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;thread_y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thread_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span [...]
-&lt;span class=&quot;n&quot;&gt;thread_x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thread_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span [...]
-&lt;span class=&quot;n&quot;&gt;ty&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;yi&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&qu [...]
-&lt;span class=&quot;n&quot;&gt;tx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;xi&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&qu [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reorder&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ty&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tx&lt;/span&gt;&lt;span class=&qu [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ty&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;thread_y&lt;/span&gt;&lt;span class= [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;thread_x&lt;/span&gt;&lt;span class= [...]
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;There are two parameters in the schedule: &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_y&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_x&lt;/code&gt;. How to determine the optimal combination of them? 
-Well, let’s first do some experiments. Below is the result with Filter = [256, 1, 3, 3] and stride = [1, 1]:&lt;/p&gt;
-
-&lt;table&gt;
-  &lt;thead&gt;
-    &lt;tr&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;Case&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;Input&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;num_thread_y&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;num_thread_x&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;TVM SAME pad (us)&lt;/th&gt;
-    &lt;/tr&gt;
-  &lt;/thead&gt;
-  &lt;tbody&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 32, 32]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;8&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;32&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;9.7&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;2&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 32, 32]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;4&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;32&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;8.8&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;3&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 32, 32]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;32&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;17.7&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;4&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 32, 32]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;32&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;32.5&lt;/td&gt;
-    &lt;/tr&gt;
-  &lt;/tbody&gt;
-&lt;/table&gt;
-
-&lt;p&gt;Many interesting observations from above results:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;
-    &lt;p&gt;Case 2 is faster than case 1. In case 2, each thread computes a 8x1 tile in output, which corresponds to a 10x3 tile in input.
-It has better data reuse than case 1’s 4x1 tile.&lt;/p&gt;
-  &lt;/li&gt;
-  &lt;li&gt;
-    &lt;p&gt;Case 3 is slower than case 2. It’s because in case 3, the workload per thread is too large and leads to much cost of local memory read.&lt;/p&gt;
-  &lt;/li&gt;
-  &lt;li&gt;
-    &lt;p&gt;Case 4 is slower than case 3. It’s because &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_x = 32&lt;/code&gt; ensures no bank conflicts, while &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_y = 32&lt;/code&gt; doesn’t.&lt;/p&gt;
-  &lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;To summarize what we learn from above observations:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;Large tile is good for data reuse, but not good for local memory read.&lt;/li&gt;
-  &lt;li&gt;The influence of &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_y&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_x&lt;/code&gt; on bank conflicts is asymmetric.&lt;/li&gt;
-  &lt;li&gt;To find the optimal combination of &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_y&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_x&lt;/code&gt; is to achieve a balance of efficient shared memory access (avoid bank conflicts), data reuse, and local memory read.&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;Pretty tricky. So, what exactly should we do to find the optimal combination? The answer is brute force search. 
-We can pass &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_y&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_x&lt;/code&gt; as arguments to the schedule function, and try all possible combinations to find the optimal one. This can be done easily in TVM:&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;schedule_depthwise_conv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt;&lt;span class=&quot; [...]
-    &lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt;
-    &lt;span class=&quot;n&quot;&gt;num_thread_x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_x&lt;/span&gt;
-    &lt;span class=&quot;n&quot;&gt;do_schedule_as_usual&lt;/span&gt;
-    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;schedule&lt;/span&gt;
-
-&lt;span class=&quot;n&quot;&gt;min_time_cost&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inf&lt;/span&gt;
-&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;all_possible_combinations&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
-    &lt;span class=&quot;n&quot;&gt;schedule&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;schedule_depthwise_conv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt;&lt;span class=& [...]
-    &lt;span class=&quot;n&quot;&gt;time_cost&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;test_depthwise_conv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;schedule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
-    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time_cost&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;min_time_cost&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
-        &lt;span class=&quot;n&quot;&gt;min_time_cost&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time_cost&lt;/span&gt;
-        &lt;span class=&quot;n&quot;&gt;optimal_combination&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;In fact, it can be seen as a simple auto scheduler.&lt;/p&gt;
-
-&lt;h3 id=&quot;vthread-and-strided-patterns&quot;&gt;Vthread and Strided Patterns&lt;/h3&gt;
-&lt;p&gt;Vthread (virtual thread) in TVM is introduced to support strided patterns. We can use it this way:&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;num_vthread_y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;num_vthread_x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;num_thread_x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;thread_vy&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thread_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_vthread_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/sp [...]
-&lt;span class=&quot;n&quot;&gt;thread_vx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thread_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_vthread_x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/sp [...]
-&lt;span class=&quot;n&quot;&gt;thread_y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thread_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span [...]
-&lt;span class=&quot;n&quot;&gt;thread_x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thread_axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_thread_x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span [...]
-&lt;span class=&quot;c1&quot;&gt;# split the dimension of height (H in NCHW) twice
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vyi&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt [...]
-&lt;span class=&quot;n&quot;&gt;ty&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;yi&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&qu [...]
-&lt;span class=&quot;c1&quot;&gt;# split the dimension of width (W in NCHW) twice
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vxi&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt [...]
-&lt;span class=&quot;n&quot;&gt;tx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;xi&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&qu [...]
-&lt;span class=&quot;c1&quot;&gt;# bind thread and vthread respectively
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;thread_vy&lt;/span&gt; [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;thread_vx&lt;/span&gt;&lt;span clas [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ty&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;thread_y&lt;/span&gt;&lt;span class= [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;thread_x&lt;/span&gt;&lt;span class= [...]
-&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reorder&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvx&lt;/span&gt;&lt;span class=& [...]
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;Let’s print the IR to see what vthread does:&lt;/p&gt;
-
-&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cm&quot;&gt;/* Input = [1, 1, 32, 32], Filter = [1, 1, 3, 3], stride = [1, 1], padding = 'SAME' */&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;produce&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(blockIdx.y, , blockIdx.y)] thread_extent = 1&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 1&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(threadIdx.y, Range(min=0, extent=8), threadIdx.y)] thread_extent = 8&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(threadIdx.x, Range(min=0, extent=8), threadIdx.x)] thread_extent = 8&lt;/span&gt;
-  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class [...]
-    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span cla [...]
-      &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)& [...]
-      &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;) [...]
-      &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;) [...]
-      &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;) [...]
-      &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;di&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dj&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-          &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&g [...]
-          &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;& [...]
-          &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;& [...]
-          &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;& [...]
-        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-      &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;Without vthread (just set to 1), the IR is:&lt;/p&gt;
-
-&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cm&quot;&gt;/* Input = [1, 1, 32, 32], Filter = [1, 1, 3, 3], stride = [1, 1], padding = 'SAME' */&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;produce&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(blockIdx.y, , blockIdx.y)] thread_extent = 1&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 1&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(threadIdx.y, Range(min=0, extent=8), threadIdx.y)] thread_extent = 8&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(threadIdx.x, Range(min=0, extent=8), threadIdx.x)] thread_extent = 8&lt;/span&gt;
-  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class [...]
-    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span cla [...]
-      &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)& [...]
-      &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;di&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dj&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-          &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&g [...]
-        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-      &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;As we can see, when &lt;code class=&quot;highlighter-rouge&quot;&gt;num_vthread_y = 2&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;num_vthread_x = 2&lt;/code&gt;, the 32 x 32 channel is divided into four sub-channels of 16 x 16.
-Each thread computes four output elements at a time, one element in one sub-channel.&lt;/p&gt;
-
-&lt;p&gt;Below is the result with Filter = [256, 1, 3, 3], stride = [1, 1], blocking_h = 32, blocking_w = 32:&lt;/p&gt;
-
-&lt;style&gt;
-table th:nth-of-type(1) {
-    width: 120px;
-}
-table th:nth-of-type(2) {
-    width: 120px;
-}
-&lt;/style&gt;
-
-&lt;table&gt;
-  &lt;thead&gt;
-    &lt;tr&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;Case&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;Input&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;num_thread_y, num_thread_x&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;num_vthread_y, num_vthread_x&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;TVM SAME pad (us)&lt;/th&gt;
-    &lt;/tr&gt;
-  &lt;/thead&gt;
-  &lt;tbody&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;8, 8&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1, 1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;132.5&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;2&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;8, 8&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1, 4&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;103.1&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;3&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;4, 32&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1, 1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;95.9&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;4&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;8, 16&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1, 2&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;90.9&lt;/td&gt;
-    &lt;/tr&gt;
-  &lt;/tbody&gt;
-&lt;/table&gt;
-
-&lt;p&gt;Case 2 is faster than case 1. It’s because in case 2 &lt;code class=&quot;highlighter-rouge&quot;&gt;num_thread_x=8&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;num_vthread_x=4&lt;/code&gt; together ensures that consecutive threads access consecutive memory addresses,
-thus avoiding bank conflicts, as illustrated below (each color represents one thread’s workload):&lt;/p&gt;
-
-&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/depthconv_tutorial/vthread_and_strided_pattern.png&quot; alt=&quot;image&quot; width=&quot;90%&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;In theory case 3 and 4 should be the same fast, since they have the same workload per thread, and both enjoy efficient shared memory access. Somehow case 4 is just a little faster.&lt;/p&gt;
-
-&lt;p&gt;Still remember tensorflow’s speed? It’s 251.6us, and now TVM is 2.8x faster. 387.4 -&amp;gt; 132.5 -&amp;gt; 95.9 -&amp;gt; 90.9, blocking helps the most; tuning thread numbers saves 37us;
-vthread saves additional 5us.&lt;/p&gt;
-
-&lt;p&gt;In fact, TVM can be extremely faster than tensorflow with large kernel size or channel_multiplier (because of more filter reuse) :&lt;/p&gt;
-
-&lt;table&gt;
-  &lt;thead&gt;
-    &lt;tr&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;Input&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;Filter&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;stride&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;tf-1.2 SAME pad (us)&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;TVM SAME pad (us)&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;How faster is TVM&lt;/th&gt;
-    &lt;/tr&gt;
-  &lt;/thead&gt;
-  &lt;tbody&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[256, 1, 3, 3]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 1]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;251.6&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;90.9&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;2.8x&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[256, 1, 5, 5]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 1]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;597.6&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;128.9&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;4.6x&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[256, 2, 3, 3]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 1]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;659.9&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;143.7&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;4.6x&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 256, 96, 96]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[256, 2, 5, 5]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;[1, 1]&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1203.9&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;170.5&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;7.1x&lt;/td&gt;
-    &lt;/tr&gt;
-  &lt;/tbody&gt;
-&lt;/table&gt;
-
-&lt;h2 id=&quot;operator-fusion&quot;&gt;Operator Fusion&lt;/h2&gt;
-
-&lt;p&gt;One typical optimization we can do in deep learning is operator fusion, that computes multiple operators together in a single kernel without saving intermediate results back to global memory.
-TVM supports that out of the box.&lt;/p&gt;
-
-&lt;p&gt;Consider a common pattern in neural networks: &lt;code class=&quot;highlighter-rouge&quot;&gt;depthwise_conv2d&lt;/code&gt; + &lt;code class=&quot;highlighter-rouge&quot;&gt;scale_shift&lt;/code&gt; + &lt;code class=&quot;highlighter-rouge&quot;&gt;relu&lt;/code&gt;. We can fuse the three operators into one, by slightly modifying the original schedule:&lt;/p&gt;
-
-&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;topi&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;depthwise_c [...]
-&lt;span class=&quot;n&quot;&gt;ScaleShift&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;topi&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;scale_shift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/s [...]
-&lt;span class=&quot;n&quot;&gt;Relu&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;topi&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ScaleShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
-
-&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Relu&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# is no longer DepthwiseConv2d
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ScaleShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;compute_inline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# this line fuses ScaleShift, explicitly
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_scope&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;local&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&qu [...]
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;schedule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# schedule for Output the same way we schedule for DepthwiseConv2d as discussed above
-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;compute_at&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Output&lt; [...]
-&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;It generates IR like this:&lt;/p&gt;
-
-&lt;div class=&quot;language-c++ highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cm&quot;&gt;/* Input = [1, 1, 32, 32], Filter = [1, 1, 3, 3], stride = [1, 1], padding = 'SAME' */&lt;/span&gt;
-&lt;span class=&quot;n&quot;&gt;produce&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Relu&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(blockIdx.y, , blockIdx.y)] thread_extent = 1&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [DepthwiseConv2d] storage_scope = &quot;local&quot;&lt;/span&gt;
-  &lt;span class=&quot;n&quot;&gt;allocate&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;float32&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/s [...]
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 1&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(threadIdx.y, Range(min=0, extent=8), threadIdx.y)] thread_extent = 8&lt;/span&gt;
-  &lt;span class=&quot;c1&quot;&gt;// attr [iter_var(threadIdx.x, Range(min=0, extent=8), threadIdx.x)] thread_extent = 8&lt;/span&gt;
-  &lt;span class=&quot;n&quot;&gt;produce&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-      &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-        &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &l [...]
-        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;di&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-          &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dj&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
-            &lt;span class=&quot;n&quot;&gt;DepthwiseConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt [...]
-          &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-      &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-  &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span clas [...]
-    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span cl [...]
-      &lt;span class=&quot;n&quot;&gt;Relu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[((((((((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;blockIdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt [...]
-    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
-
-&lt;p&gt;As we can see, each thread computes &lt;code class=&quot;highlighter-rouge&quot;&gt;scale_shift&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;relu&lt;/code&gt; before writing the result of &lt;code class=&quot;highlighter-rouge&quot;&gt;depthwise_conv2d&lt;/code&gt; to global memory. The fused operator is as fast as single &lt;code class=&quot;highlighter-rouge&quot;&gt;depthwise_conv2d&lt;/code&gt;.
-Below is the result with Input = [1, 256, 96, 96], Filter = [256, 1, 3, 3], stride = [1, 1], padding = ‘SAME’:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;tf-1.2 &lt;code class=&quot;highlighter-rouge&quot;&gt;depthwise_conv2d&lt;/code&gt;: 251.6 us&lt;/li&gt;
-  &lt;li&gt;tf-1.2 &lt;code class=&quot;highlighter-rouge&quot;&gt;depthwise_conv2d&lt;/code&gt; + &lt;code class=&quot;highlighter-rouge&quot;&gt;scale_shift&lt;/code&gt; + &lt;code class=&quot;highlighter-rouge&quot;&gt;relu&lt;/code&gt; (separate): 419.9 us&lt;/li&gt;
-  &lt;li&gt;TVM &lt;code class=&quot;highlighter-rouge&quot;&gt;depthwise_conv2d&lt;/code&gt;: 90.9 us&lt;/li&gt;
-  &lt;li&gt;TVM &lt;code class=&quot;highlighter-rouge&quot;&gt;depthwise_conv2d + scale_shift + relu&lt;/code&gt; (fused): 91.5 us&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;The advantage of operator fusion is obvious.&lt;/p&gt;
-
-&lt;p&gt;This is not the end, TVM can do operator fusion in a smarter way. You may refer to &lt;a href=&quot;https://github.com/dmlc/tvm/issues/215&quot;&gt;this&lt;/a&gt; and read the source code provided below.&lt;/p&gt;
-
-&lt;h2 id=&quot;show-me-the-code&quot;&gt;Show me the code&lt;/h2&gt;
-&lt;ul&gt;
-  &lt;li&gt;Declare: &lt;a href=&quot;https://github.com/dmlc/tvm/blob/master/topi/python/topi/nn/depthwise_conv2d.py&quot;&gt;https://github.com/dmlc/tvm/blob/master/topi/python/topi/nn/depthwise_conv2d.py&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;Schedule: &lt;a href=&quot;https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/depthwise_conv2d.py&quot;&gt;https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/depthwise_conv2d.py&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;Test: &lt;a href=&quot;https://github.com/dmlc/tvm/blob/master/topi/recipe/conv/depthwise_conv2d_test.py&quot;&gt;https://github.com/dmlc/tvm/blob/master/topi/recipe/conv/depthwise_conv2d_test.py&lt;/a&gt;&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;
-&lt;p&gt;The author has many thanks to Tianqi Chen for his helpful advice and inspiring discussion.&lt;/p&gt;
-
-&lt;h2 id=&quot;bio&quot;&gt;Bio&lt;/h2&gt;
-&lt;p&gt;&lt;a href=&quot;https://Huyuwei.github.io&quot;&gt;Yuwei Hu&lt;/a&gt; is an intern in &lt;a href=&quot;http://tusimple.ai/&quot;&gt;Tusimple&lt;/a&gt;’s HPC group.
-He is experiencing a gap year after obtaining a bachelor’s degree in electrical engineering from Beihang University.&lt;/p&gt;
-
-&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;
-&lt;p&gt;[1] &lt;a href=&quot;https://arxiv.org/abs/1610.02357&quot;&gt;Xception: Deep Learning with Depthwise Separable Convolutions&lt;/a&gt;&lt;/p&gt;
-
-&lt;p&gt;[2] &lt;a href=&quot;https://arxiv.org/abs/1704.04861&quot;&gt;MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications&lt;/a&gt;&lt;/p&gt;
-
-&lt;p&gt;[3] &lt;a href=&quot;http://norvig.com/21-days.html#answers&quot;&gt;Approximate timing for various operations on a typical PC&lt;/a&gt;&lt;/p&gt;
-</description>
-                <link>https://tvm.apache.org/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example</link>
-                <guid>https://tvm.apache.org/2017/08/22/Optimize-Deep-Learning-GPU-Operators-with-TVM-A-Depthwise-Convolution-Example</guid>
-                <pubDate>Tue, 22 Aug 2017 00:00:00 -0700</pubDate>
-        </item>
-
 
 </channel>
 </rss>
diff --git a/sitemap.txt b/sitemap.txt
index 9ca86cb..fb2dc63 100644
--- a/sitemap.txt
+++ b/sitemap.txt
@@ -12,6 +12,7 @@ https://tvm.apache.org/sitemap.txt
 https://tvm.apache.org/tags
 https://tvm.apache.org/vta
 
+https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm
 https://tvm.apache.org/2020/07/14/bert-pytorch-tvm
 https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny
 https://tvm.apache.org/2020/05/20/bring-your-own-datatypes