You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2022/02/02 01:02:24 UTC

[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r797193262



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;

Review comment:
       We have not spend time on the TIR flow and passes - we will.
   One quick question, is TIR buffer and its data-layout can lead how inputs/outputs of Marvell sub-graphs and LLVM-non-Marvell sub-graphs are communicated during inference runtime?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org