You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2020/01/16 17:22:06 UTC
[GitHub] [incubator-tvm] tmoreau89 commented on a change in pull request #4718: [Docs] Bring Your Own Codegen Guide -- Part 2

tmoreau89 commented on a change in pull request #4718: [Docs] Bring Your Own Codegen Guide -- Part 2
URL: https://github.com/apache/incubator-tvm/pull/4718#discussion_r367542596
 
 

 ##########
 File path: docs/dev/relay_bring_your_own_codegen.rst
 ##########
 @@ -0,0 +1,932 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+..    http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+=============================
+Bring Your Own Codegen To TVM
+=============================
+**Author**: `Zhi Chen <https://github.com/zhiics>`_, `Cody Hao Yu <https:://github.com/comaniac>`_
+
+As the number of hardware devices targeted by deep learning workloads keeps increasing, the required knowledge for users to achieve high performance on various devices keeps increasing as well. To free data scientists from worrying about the performance when developing a new model, hardware backend providers either provide libraries such as MKLDNN or cuDNN with many commonly used deep learning operators, or provide frameworks such as TensorRT to let users describe their models in a certain way to achieve high performance. However, users have to learn a new programming interface when they attempt to work on a new library or device. As a result, the demand for a unified programming interface becomes more and more important to 1) let all users and hardware backend providers stand on the same page, and 2) provide a feasible solution to allow specialized hardware or library to only support widely used operators with extremely high performance, but fallback unsupported operators to general devices like CPU/GPU.
+
+In this developer guide, we demonstrate how you, as a hardware backend provider, can easily implement your own codegen and register it as a Relay backend compiler to support your hardware device/library. This guide covers two types of codegen based on different graph representations you need:
+
+**1. You want to generate C code.**
+
+If your hardware already has a well-optimized C/C++ library, such as Intel CBLAS/MKL to CPU and NVIDIA CUBLAS to GPU, then this is what you are looking for. Fortunately, C source code module is fully compatible with TVM runtime module, which means the generated code could be compiled by any C/C++ compiler with proper compilation flags, so the only task you have is to implement a codegen that generates C code for subgraphs and a C source module to integrate into TVM runtime module. We will demonstrate how to implement a C code generator for your hardware in the following section.
+
+**2. You want to generate any other graph representations.**
+
+Your hardware may require other forms of graph representation, such as JSON. In this case, you need to implement not only a codegen but also a customized TVM runtime module to let TVM runtime know how this graph representation should be executed. If you already have a complete graph execution engine for your hardware, such as TensorRT for GPU, then this is a solution you can consider.
+
+After you finish the codegen and runtime, you can then let your customers annotate their models with your customized tag to make use of them. The tutorial for end-users to annotate and launch a specific codegen is **here (TBA)**.
+
+*********************
+Implement a C Codegen
+*********************
+
+In this part, we demonstrate how to implement a codegen that generates C code with pre-implemented operator functions. To simplify, our example codegen does not depend on third-party libraries. Instead, we manually implement two macros in C:
+
+.. code-block:: c++
+
+    #define CSOURCE_BINARY_OP_1D(p_ID_, p_OP_, p_DIM1_)         \
+        extern "C" void p_ID_(float* a, float* b, float* out) { \
+            for (int64_t i = 0; i < p_DIM1_; ++i) {             \
+                out[i] = a[i] p_OP_ b[i];                       \
+            }                                                   \
+        }
+
+    #define CSOURCE_BINARY_OP_2D(p_ID_, p_OP_, p_DIM1_, p_DIM2_)  \
+        extern "C" void p_ID_(float* a, float* b, float* out) {   \
+            for (int64_t i = 0; i < p_DIM1_; ++i) {               \
+                for (int64_t j = 0; j < p_DIM2_; ++j) {           \
+                    int64_t k = i * p_DIM2_ + j;                  \
+                    out[k] = a[k] p_OP_ b[k];                     \
+                }                                                 \
+            }                                                     \
+        }
+
+With the two macros, we can generate binary operators for 1-D and 2-D tensors. For example, given a subgraph as follows. Assuming all inputs are 2-D tensors with shape (10, 10).
+
+::
+
+    c_compiler_input0
+           |
+          add <-- c_compiler_input1
+           |
+        subtract <-- c_compiler_input2
+           |
+        multiply <-- c_compiler_input3
+           |
+          out
+
+Our goal is to generate the following compilable code to execute the subgraph:
+
+.. code-block:: c++
+
+    #include <tvm/runtime/c_runtime_api.h>
+    #include <tvm/runtime/packed_func.h>
+    #include <dlpack/dlpack.h>
+    #include <cstdint>
+    #include <cstring>
+    #include <iostream>
+
+    #define GCC_BINARY_OP_1D(p_ID_, p_OP_, p_DIM1_)           \
+      extern "C" void p_ID_(float* a, float* b, float* out) { \
+        for (int64_t i = 0; i < p_DIM1_; ++i) {               \
+          out[i] = a[i] p_OP_ b[i];                           \
+        }                                                     \
+      }
+
+    #define GCC_BINARY_OP_2D(p_ID_, p_OP_, p_DIM1_, p_DIM2_)  \
+      extern "C" void p_ID_(float* a, float* b, float* out) { \
+        for (int64_t i = 0; i < p_DIM1_; ++i) {               \
+          for (int64_t j = 0; j < p_DIM2_; ++j) {             \
+            int64_t k = i * p_DIM2_ + j;                      \
+            out[k] = a[k] p_OP_ b[k];                         \
+          }                                                   \
+        }                                                     \
+      }
+
+    // Note 1
+    GCC_BINARY_OP_2D(gcc_0_0, *, 10, 10);
+    GCC_BINARY_OP_2D(gcc_0_1, -, 10, 10);
+    GCC_BINARY_OP_2D(gcc_0_2, +, 10, 10);
+
+    // Note 2
+    extern "C" void gcc_0_(float* gcc_input0, float* gcc_input1,
+                           float* gcc_input2, float* gcc_input3, float* out) {
+      float* buf_0 = (float*)malloc(4 * 100);
+      float* buf_1 = (float*)malloc(4 * 100);
+      gcc_0_2(gcc_input0, gcc_input1, buf_0);
+      gcc_0_1(buf_0, gcc_input2, buf_1);
+      gcc_0_0(buf_1, gcc_input3, out);
+      free(buf_0);
+      free(buf_1);
+    }
+
+    // Note 3
+    extern "C" int gcc_0_wrapper(DLTensor* arg0, DLTensor* arg1, DLTensor* arg2,
+                                 DLTensor* arg3, DLTensor* out) {
+      gcc_0_(static_cast<float*>(arg0->data), static_cast<float*>(arg1->data),
+             static_cast<float*>(arg2->data), static_cast<float*>(arg3->data),
+             static_cast<float*>(out->data));
+      return 0;
+    }
+    TVM_DLL_EXPORT_TYPED_FUNC(gcc_0, gcc_0_wrapper);
+
+Here we highlight the notes marked in the above code:
+
+* **Note 1** is the function implementation for the three nodes in the subgraph.
+
+* **Note 2** is a function to execute the subgraph by allocating intermediate buffers and invoking corresponding functions.
+
+* **Note 3** is a TVM runtime compatible wrapper function. It accepts a list of input tensors and one output tensor (the last argument), casts them to the right data type, and invokes the subgraph function described in Note 2. In addition, ``TVM_DLL_EXPORT_TYPED_FUNC`` is a TVM macro that generates another function ``gcc_0`` with unified the function arguments by packing all tensors to ``TVMArgs``. As a result, the TVM runtime can directly invoke ``gcc_0`` to execute the subgraph without additional efforts. With the above code generated, TVM is able to compile it along with the rest parts of the graph and export a single library for deployment.
+
+In the rest of this section, we will implement a codegen step-by-step to generate the above code. Your own codegen has to be located at ``src/relay/backend/contrib/<your-codegen-name>/``. In our example, we name our codegen "codegen_c" and put it under ``src/relay/backend/contrib/codegen_c/codegen.cc``. Feel free to check this file for a complete implementation.
+
+Specifically, we are going to implement two classes in this file and here is their relationship:
+
+::
+
+                       subgraph                                subgraph
+  TVM backend -----------------------------> CSourceCodegen -------------> CodegenC
+         ^                                       |    ^                       |
+         |                                       |    |                       |
+         ----------------------------------------      ------------------------
+            generated C source runtime module              generated C code
+
+When TVM backend finds a function (subgraph) in a Relay graph is annotated with the registered compiler tag (``ccompiler`` in this example), TVM backend invokes ``CSourceCodegen`` and passes the subgraph. ``CSourceCodegen``'s member function ``CreateCSourceModule`` will 1) generate C code for the subgraph, and 2) wrap the generated C code to a C source runtime module for TVM backend to compile and deploy. In particular, the C code generation is transparent to the ``CodegenC`` class because it provides many useful utilities to ease the code generation implementation. The following sections will implement these two classes in the bottom-up order.
 
 Review comment:
   "finds a function (subgraph) in a Relay graph is annotated with" -> "finds in a Relay graph a function (subgraph) that is annotated with..."

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services