You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/12/17 21:34:16 UTC

[GitHub] [tvm-rfcs] ccjoechou opened a new pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

ccjoechou opened a new pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48


   * Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
   
   We have also posted a pre-RFC at https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691.
   Plus, we have up-streamed our POC code changes in:  PR-9730 (https://github.com/apache/tvm/pull/9730). We have resolved a Mrvl.cmake issue but we are now waiting for tips from the TVM community in order to make the PR's Jenkins task_rust.sh to pass. 
   
   Note1: we have not spend much time on driver/runtime integration and therefore can be missing changes for rust cargo. We are trying to catch up here.
   Note2: we do run TVM-Jenkinsfile-like build & tests locally but we have skipped the task_rust.sh script during our locally run.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r787288652



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:

Review comment:
       oh the email version of this question was linking to a different RFC segment (the light green section above) -- led to me to answer differently in my in-line reply to your email.
   Sorry and since I noticed here and can see the correct light green block above corresponding to your question, so let me reply here again properly.
   
   We are using the TVM graph executor codegen to process BYOC-Marvell-relay-seq-generated Marvell-part of the IR sub-graph, which includes Marvell-specific GraphInputNode object(s) & attributes and Marvell-specific GraphOpNode object(s) and attributes. When processing the Marvell sub-graph in the TVM graph executor codegen, we need to specialize the generation code in order to dump extra Marvell-specified attributes to node-JSON file (also in a more readable format). The original code can't do what we need; hence, we are using derive classes and call back functions in C++ to overwrite defaults here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r787277520



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs

Review comment:
       I am not sure what you meant by “device planning pass”? We have been following what others did in tvm/python/tvm/relay/op/contrib by utilizing relay passes (for example, ConvertLayout, MergeComposite, AnnotateTarget, and etc.). Please note that in this RFC, we only want to generate JSON meta files and we are not ready to propose/up-stream our runtime & driver hookups yet). 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] areusch commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
areusch commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r789196282



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included

Review comment:
       cc @jroesch can we unblock their rust debugging? @ccjoechou i'm not as familiar with the rust stuff in TVM, but we should transfer ownership of any rust packages to a TVM account. apologies for any oversight there.

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run

Review comment:
       ok--suggest to either add a period or maybe reword as "For this RFC, we will focus only on models that use float16 quantization mode."

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute

Review comment:
       ok. i'd like to suggest for clarity's sake we use exprnode_id here.

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
+    * Some common classes have been moved from the original src/relay/backend/graph_executor_codegen.cc file to the
+      new src/relay/backend/graph_executor_codegen.h file so that they can be shared by Marvell-specific functions
+      and derived classes defined in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+
+    * new definitions are listed below:
+```
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.h file
+    /*! \brief Node types */
+    enum GraphNodeType {
+      kGraphNop,
+      kGraphInputNode,
+      kGraphOpNode,
+      kGraphInputNodeExt,
+      kGraphOpNodeExt,
+    };
+
+    
+    class ExternalJsonWriterCB {
+     public:
+      template <class T>
+      void RegisterCB(T* const object,
+                      void (T::*const mf)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                          std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        using namespace std::placeholders;
+        callback_ = std::bind(mf, object, _1, _2, _3, _4);
+        hasCallback_ = true;
+      }
+      void RegisterCB(void (*const fun)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                        std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        callback_ = fun;
+        hasCallback_ = true;
+      }
+      void Exe(dmlc::JSONWriter* external_writer, Array<tvm::runtime::Module> mod,
+               std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads) {
+        ICHECK(hasCallback_) << "ERROR: no registered callback";
+        callback_(external_writer, mod, nodes, heads);
+      }
+      inline bool HasCallback() { return hasCallback_; }
+
+     private:
+      std::function<void(dmlc::JSONWriter*, Array<tvm::runtime::Module>, std::vector<GraphObjectPtr>,
+                         std::vector<GraphNodeRef>)>
+          callback_;
+      bool hasCallback_{false};
+    };
+
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.cc file
+    class GraphExecutorCodegen : public backend::MemoizedExprTranslator<std::vector<GraphNodeRef>> {
+     public:
+      GraphExecutorCodegen(runtime::Module* mod, const TargetMap& targets)
+          : mod_(mod), targets_(targets) {
+        // we need the following variable to be a static member of the class so we can access
+        //   its setting in the following static GetExternalJsonWriter() function; but this static
+        //   member can actually be used as a local Callback setting for "per" GraphExecutorCodegen
+        //   instantiation during each TVM build-codegen flow
+        external_json_writer_ = std::make_shared<ExternalJsonWriterCB>();
+        ICHECK(external_json_writer_);
+      }
+      static ExternalJsonWriterCB* GetExternalJsonWriter() { return external_json_writer_.get(); }
+      ....
+      LoweredOutput Codegen(IRModule mod, relay::Function func, String mod_name) {
+        ....
+
+        // if it has been registered for this GraphExecutorCodegen object, call the external JSON writer
+        if (external_json_writer_->HasCallback()) {
+          std::ostringstream external_os;
+          dmlc::JSONWriter external_writer(&external_os);
+          external_json_writer_->Exe(&external_writer, ret.external_mods, nodes_, heads_);
+          ret.external_graph_json = external_os.str();
+        }
+
+        return ret;
+      }
+    };
+
+    extern "C" ExternalJsonWriterCB* GetExternalJsonWriter() {
+      return GraphExecutorCodegen::GetExternalJsonWriter();
+    }
+
+    /////////////
+    // in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+    // Marvell-specific extentions
+    class GraphInputNodeMrvlExt : public GraphInputNode {
+        ...
+        GraphNodeType Type() const override { return kGraphInputNodeExt; }
+        void Save(dmlc::JSONWriter* writer) const override { /* extensions */ }
+    }
+
+    class GraphOpNodeMrvlExt : public GraphOpNode {
+        ...
+        GraphNodeType Type() const override { return kGraphOpNodeExt; }
+        void Load(dmlc::JSONReader* reader) override;
+        void LoadAttrs(dmlc::JSONReader* reader);
+        std::pair<std::string, GraphAttrs> GetLoadedGraphAttrs();
+    }
+
+    class MrvlExtJson {
+     public:
+      MrvlExtJson() {
+        ICHECK(!GetExternalJsonWriter()->HasCallback()) << "ERROR: has registered callback";
+        GetExternalJsonWriter()->RegisterCB(this, &MrvlExtJson::GetExternalJSON);
+      }
+      virtual ~MrvlExtJson() {}
+      void GetExternalJSON(dmlc::JSONWriter* writer, Array<tvm::runtime::Module> external_mods,
+                           std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads);
+      void LoadExternalJsonAttrs(std::unordered_map<std::string, GraphAttrs>* external_attrs_map,
+                                 const Array<tvm::runtime::Module>& external_mods);
+    };
+```
+
+* the need to link between pre-trained model and final Marvell backend layer - for instance, through tvm_custom
+    * We did not include prototype code in PR-9730 but intend to provide our sample changes in another RFC and PR.
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+* We haven't identified any major *not* do items. Several other designs are by choices - that is we understand that
+  there are benefits for doing or benefits for not-doing.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+* We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.
+
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+* We are following the existing TVM BYOC framework and example files.
+    * for example: to do IR compositions, to define own IR passes, to mix implementations in Python/C++, and etc.
+
+* We have extended graph_executor_codegen.cc and JSON loader/saver in order to read and write out Marvell specific
+  attributes
+
+* Currently, we haven't spend enough time to under how tvm/rust/cargo requirements and steps. Therefore, we are
+  bypassing the tvm/Jenkinsfile's tests/scripts/task_rust.sh step. We will need help to re-enable the step.
+
+* We like to duplicate the Jenkins environment in order to run tvm/Jenkinsfile as is, but, we ran into many issues.
+  Currently, we have a tvm-like Jenksinsfile environment to only run a subset of test suites using a modified
+  Jenkinsfile.
+
+* We have identified a need to allow a call-back function to be registered when generating Mrvl-BYOC-specific
+  Nodes-JSON file. We are trying to follow TVM Python/CPP-CB style as much as possible. But, since our callback
+  function tvm/src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc::GetExternalJSON() function is using
+  non-simple argument types, we need help from TVM community to provide suggestions/guidelines in order to make
+  new CB code better to meet TVM community requirements here.
+
+* For one Mrvl-BYOC relay transformation pass, we have identified a need to inject a (global) expr node ID for the
+  RelayExprNode class and its derived classes: Tuple and CallNode, so that during the transformation pass, we can
+  uniquely identify each Tuple or CallNode object. Again, we need help from TVM community to provide
+  suggestions/guidelines here in order to know whether this is one of the best ways to achieve the Mrvl-BYOC need.
+
+* We also identified a need to maintain linkages between (operator-)information described in the original, given
+  pre-trained network model and the code-gen JSON files so that the compiler backend will be able to report user-level
+  (e.g., meaningful-to-user) messages regarding the given pre-trained network. For instance, in the
+  tvm/python/tvm/relay/frontend/onnx.py and common.py files, we can see user-level information being captured using
+  “tvm_custom” related code as in original onnx.py file for the given pre-trained network; but, in common.py, the code
+  later drops the linkage, via attrs.pop(“tvm_custom”), and does not pass the linkage onto the initial relay IR graph.
+  We have a draft solution to maintain linkages between the given pre-trained network model and its relay IR graph
+  (using expr node ID and tvm custom ID, plus, a few utility functions), but would like to know whether the TVM
+  community has any better or work-in-progress resolution.
+
+* When using TVM RPC code to exercise and run inference on a remote-hosted Mrvl ML/AI HW accelerator for the Mrvl
+  subgraph, we ran into one minor issue and have made local TVM RPC enhancement so that, when a TVM RPC client sends
+  a file to the remote server, the TVM RPC client can know where the remote server saves the file on the remote machine.
+  Since this is not directly related to this Mrvl-BYOC PR, we will find time to contribute this enhance back in another
+  TVM PR soon.
+
+* In order for us to generate the constants-JSON file, we must “NOT” remove external params, which were stored in

Review comment:
       it might be possible for you to do this in TIR, if you're able to leverage [tir.constant](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0022-tir-non-scalar-constants.md). you would need to use https://github.com/apache/tvm-rfcs/blob/main/rfcs/0010-target-registered-compiler-flow-customisation.md, so I'm not sure if that's appropriate here.

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for

Review comment:
       you can add them to assets/ and then link similar to https://github.com/apache/tvm-rfcs/blob/main/rfcs/0050-roadmaps.md (see the Raw source for example how to link it).

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
+    * Some common classes have been moved from the original src/relay/backend/graph_executor_codegen.cc file to the
+      new src/relay/backend/graph_executor_codegen.h file so that they can be shared by Marvell-specific functions
+      and derived classes defined in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+
+    * new definitions are listed below:
+```
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.h file
+    /*! \brief Node types */
+    enum GraphNodeType {
+      kGraphNop,
+      kGraphInputNode,
+      kGraphOpNode,
+      kGraphInputNodeExt,
+      kGraphOpNodeExt,
+    };
+
+    
+    class ExternalJsonWriterCB {
+     public:
+      template <class T>
+      void RegisterCB(T* const object,
+                      void (T::*const mf)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                          std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        using namespace std::placeholders;
+        callback_ = std::bind(mf, object, _1, _2, _3, _4);
+        hasCallback_ = true;
+      }
+      void RegisterCB(void (*const fun)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                        std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        callback_ = fun;
+        hasCallback_ = true;
+      }
+      void Exe(dmlc::JSONWriter* external_writer, Array<tvm::runtime::Module> mod,
+               std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads) {
+        ICHECK(hasCallback_) << "ERROR: no registered callback";
+        callback_(external_writer, mod, nodes, heads);
+      }
+      inline bool HasCallback() { return hasCallback_; }
+
+     private:
+      std::function<void(dmlc::JSONWriter*, Array<tvm::runtime::Module>, std::vector<GraphObjectPtr>,
+                         std::vector<GraphNodeRef>)>
+          callback_;
+      bool hasCallback_{false};
+    };
+
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.cc file
+    class GraphExecutorCodegen : public backend::MemoizedExprTranslator<std::vector<GraphNodeRef>> {
+     public:
+      GraphExecutorCodegen(runtime::Module* mod, const TargetMap& targets)
+          : mod_(mod), targets_(targets) {
+        // we need the following variable to be a static member of the class so we can access
+        //   its setting in the following static GetExternalJsonWriter() function; but this static
+        //   member can actually be used as a local Callback setting for "per" GraphExecutorCodegen
+        //   instantiation during each TVM build-codegen flow
+        external_json_writer_ = std::make_shared<ExternalJsonWriterCB>();
+        ICHECK(external_json_writer_);
+      }
+      static ExternalJsonWriterCB* GetExternalJsonWriter() { return external_json_writer_.get(); }
+      ....
+      LoweredOutput Codegen(IRModule mod, relay::Function func, String mod_name) {
+        ....
+
+        // if it has been registered for this GraphExecutorCodegen object, call the external JSON writer
+        if (external_json_writer_->HasCallback()) {
+          std::ostringstream external_os;
+          dmlc::JSONWriter external_writer(&external_os);
+          external_json_writer_->Exe(&external_writer, ret.external_mods, nodes_, heads_);
+          ret.external_graph_json = external_os.str();
+        }
+
+        return ret;
+      }
+    };
+
+    extern "C" ExternalJsonWriterCB* GetExternalJsonWriter() {
+      return GraphExecutorCodegen::GetExternalJsonWriter();
+    }
+
+    /////////////
+    // in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+    // Marvell-specific extentions
+    class GraphInputNodeMrvlExt : public GraphInputNode {
+        ...
+        GraphNodeType Type() const override { return kGraphInputNodeExt; }
+        void Save(dmlc::JSONWriter* writer) const override { /* extensions */ }
+    }
+
+    class GraphOpNodeMrvlExt : public GraphOpNode {
+        ...
+        GraphNodeType Type() const override { return kGraphOpNodeExt; }
+        void Load(dmlc::JSONReader* reader) override;
+        void LoadAttrs(dmlc::JSONReader* reader);
+        std::pair<std::string, GraphAttrs> GetLoadedGraphAttrs();
+    }
+
+    class MrvlExtJson {
+     public:
+      MrvlExtJson() {
+        ICHECK(!GetExternalJsonWriter()->HasCallback()) << "ERROR: has registered callback";
+        GetExternalJsonWriter()->RegisterCB(this, &MrvlExtJson::GetExternalJSON);
+      }
+      virtual ~MrvlExtJson() {}
+      void GetExternalJSON(dmlc::JSONWriter* writer, Array<tvm::runtime::Module> external_mods,
+                           std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads);
+      void LoadExternalJsonAttrs(std::unordered_map<std::string, GraphAttrs>* external_attrs_map,
+                                 const Array<tvm::runtime::Module>& external_mods);
+    };
+```
+
+* the need to link between pre-trained model and final Marvell backend layer - for instance, through tvm_custom
+    * We did not include prototype code in PR-9730 but intend to provide our sample changes in another RFC and PR.
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+* We haven't identified any major *not* do items. Several other designs are by choices - that is we understand that
+  there are benefits for doing or benefits for not-doing.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+* We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.
+
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+* We are following the existing TVM BYOC framework and example files.
+    * for example: to do IR compositions, to define own IR passes, to mix implementations in Python/C++, and etc.
+
+* We have extended graph_executor_codegen.cc and JSON loader/saver in order to read and write out Marvell specific
+  attributes
+
+* Currently, we haven't spend enough time to under how tvm/rust/cargo requirements and steps. Therefore, we are
+  bypassing the tvm/Jenkinsfile's tests/scripts/task_rust.sh step. We will need help to re-enable the step.
+
+* We like to duplicate the Jenkins environment in order to run tvm/Jenkinsfile as is, but, we ran into many issues.
+  Currently, we have a tvm-like Jenksinsfile environment to only run a subset of test suites using a modified
+  Jenkinsfile.
+
+* We have identified a need to allow a call-back function to be registered when generating Mrvl-BYOC-specific
+  Nodes-JSON file. We are trying to follow TVM Python/CPP-CB style as much as possible. But, since our callback
+  function tvm/src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc::GetExternalJSON() function is using
+  non-simple argument types, we need help from TVM community to provide suggestions/guidelines in order to make
+  new CB code better to meet TVM community requirements here.
+
+* For one Mrvl-BYOC relay transformation pass, we have identified a need to inject a (global) expr node ID for the
+  RelayExprNode class and its derived classes: Tuple and CallNode, so that during the transformation pass, we can
+  uniquely identify each Tuple or CallNode object. Again, we need help from TVM community to provide
+  suggestions/guidelines here in order to know whether this is one of the best ways to achieve the Mrvl-BYOC need.
+
+* We also identified a need to maintain linkages between (operator-)information described in the original, given
+  pre-trained network model and the code-gen JSON files so that the compiler backend will be able to report user-level
+  (e.g., meaningful-to-user) messages regarding the given pre-trained network. For instance, in the
+  tvm/python/tvm/relay/frontend/onnx.py and common.py files, we can see user-level information being captured using
+  “tvm_custom” related code as in original onnx.py file for the given pre-trained network; but, in common.py, the code
+  later drops the linkage, via attrs.pop(“tvm_custom”), and does not pass the linkage onto the initial relay IR graph.
+  We have a draft solution to maintain linkages between the given pre-trained network model and its relay IR graph
+  (using expr node ID and tvm custom ID, plus, a few utility functions), but would like to know whether the TVM
+  community has any better or work-in-progress resolution.
+
+* When using TVM RPC code to exercise and run inference on a remote-hosted Mrvl ML/AI HW accelerator for the Mrvl
+  subgraph, we ran into one minor issue and have made local TVM RPC enhancement so that, when a TVM RPC client sends

Review comment:
       could you try calling `tvm.rpc.server.workpath` on the RPC server? https://github.com/apache/tvm/blob/main/python/tvm/rpc/server.py#L62

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:

Review comment:
       ok got it, i think that makes sense to me. i think the main question i have here is the mechanism by which you guys export the Marvell GraphExecutor sub-graph.

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
+    * Some common classes have been moved from the original src/relay/backend/graph_executor_codegen.cc file to the
+      new src/relay/backend/graph_executor_codegen.h file so that they can be shared by Marvell-specific functions
+      and derived classes defined in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+
+    * new definitions are listed below:
+```
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.h file
+    /*! \brief Node types */
+    enum GraphNodeType {
+      kGraphNop,
+      kGraphInputNode,
+      kGraphOpNode,
+      kGraphInputNodeExt,
+      kGraphOpNodeExt,
+    };
+
+    
+    class ExternalJsonWriterCB {
+     public:
+      template <class T>
+      void RegisterCB(T* const object,
+                      void (T::*const mf)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                          std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        using namespace std::placeholders;
+        callback_ = std::bind(mf, object, _1, _2, _3, _4);
+        hasCallback_ = true;
+      }
+      void RegisterCB(void (*const fun)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                        std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        callback_ = fun;
+        hasCallback_ = true;
+      }
+      void Exe(dmlc::JSONWriter* external_writer, Array<tvm::runtime::Module> mod,
+               std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads) {
+        ICHECK(hasCallback_) << "ERROR: no registered callback";
+        callback_(external_writer, mod, nodes, heads);
+      }
+      inline bool HasCallback() { return hasCallback_; }
+
+     private:
+      std::function<void(dmlc::JSONWriter*, Array<tvm::runtime::Module>, std::vector<GraphObjectPtr>,
+                         std::vector<GraphNodeRef>)>
+          callback_;
+      bool hasCallback_{false};
+    };
+
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.cc file
+    class GraphExecutorCodegen : public backend::MemoizedExprTranslator<std::vector<GraphNodeRef>> {
+     public:
+      GraphExecutorCodegen(runtime::Module* mod, const TargetMap& targets)
+          : mod_(mod), targets_(targets) {
+        // we need the following variable to be a static member of the class so we can access
+        //   its setting in the following static GetExternalJsonWriter() function; but this static
+        //   member can actually be used as a local Callback setting for "per" GraphExecutorCodegen
+        //   instantiation during each TVM build-codegen flow
+        external_json_writer_ = std::make_shared<ExternalJsonWriterCB>();
+        ICHECK(external_json_writer_);
+      }
+      static ExternalJsonWriterCB* GetExternalJsonWriter() { return external_json_writer_.get(); }
+      ....
+      LoweredOutput Codegen(IRModule mod, relay::Function func, String mod_name) {
+        ....
+
+        // if it has been registered for this GraphExecutorCodegen object, call the external JSON writer
+        if (external_json_writer_->HasCallback()) {
+          std::ostringstream external_os;
+          dmlc::JSONWriter external_writer(&external_os);
+          external_json_writer_->Exe(&external_writer, ret.external_mods, nodes_, heads_);
+          ret.external_graph_json = external_os.str();
+        }
+
+        return ret;
+      }
+    };
+
+    extern "C" ExternalJsonWriterCB* GetExternalJsonWriter() {
+      return GraphExecutorCodegen::GetExternalJsonWriter();
+    }
+
+    /////////////
+    // in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+    // Marvell-specific extentions
+    class GraphInputNodeMrvlExt : public GraphInputNode {
+        ...
+        GraphNodeType Type() const override { return kGraphInputNodeExt; }
+        void Save(dmlc::JSONWriter* writer) const override { /* extensions */ }
+    }
+
+    class GraphOpNodeMrvlExt : public GraphOpNode {
+        ...
+        GraphNodeType Type() const override { return kGraphOpNodeExt; }
+        void Load(dmlc::JSONReader* reader) override;
+        void LoadAttrs(dmlc::JSONReader* reader);
+        std::pair<std::string, GraphAttrs> GetLoadedGraphAttrs();
+    }
+
+    class MrvlExtJson {
+     public:
+      MrvlExtJson() {
+        ICHECK(!GetExternalJsonWriter()->HasCallback()) << "ERROR: has registered callback";
+        GetExternalJsonWriter()->RegisterCB(this, &MrvlExtJson::GetExternalJSON);
+      }
+      virtual ~MrvlExtJson() {}
+      void GetExternalJSON(dmlc::JSONWriter* writer, Array<tvm::runtime::Module> external_mods,
+                           std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads);
+      void LoadExternalJsonAttrs(std::unordered_map<std::string, GraphAttrs>* external_attrs_map,
+                                 const Array<tvm::runtime::Module>& external_mods);
+    };
+```
+
+* the need to link between pre-trained model and final Marvell backend layer - for instance, through tvm_custom
+    * We did not include prototype code in PR-9730 but intend to provide our sample changes in another RFC and PR.
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+* We haven't identified any major *not* do items. Several other designs are by choices - that is we understand that
+  there are benefits for doing or benefits for not-doing.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+* We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.
+
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+* We are following the existing TVM BYOC framework and example files.
+    * for example: to do IR compositions, to define own IR passes, to mix implementations in Python/C++, and etc.
+
+* We have extended graph_executor_codegen.cc and JSON loader/saver in order to read and write out Marvell specific
+  attributes
+
+* Currently, we haven't spend enough time to under how tvm/rust/cargo requirements and steps. Therefore, we are
+  bypassing the tvm/Jenkinsfile's tests/scripts/task_rust.sh step. We will need help to re-enable the step.
+
+* We like to duplicate the Jenkins environment in order to run tvm/Jenkinsfile as is, but, we ran into many issues.
+  Currently, we have a tvm-like Jenksinsfile environment to only run a subset of test suites using a modified
+  Jenkinsfile.
+
+* We have identified a need to allow a call-back function to be registered when generating Mrvl-BYOC-specific
+  Nodes-JSON file. We are trying to follow TVM Python/CPP-CB style as much as possible. But, since our callback
+  function tvm/src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc::GetExternalJSON() function is using
+  non-simple argument types, we need help from TVM community to provide suggestions/guidelines in order to make
+  new CB code better to meet TVM community requirements here.
+
+* For one Mrvl-BYOC relay transformation pass, we have identified a need to inject a (global) expr node ID for the
+  RelayExprNode class and its derived classes: Tuple and CallNode, so that during the transformation pass, we can
+  uniquely identify each Tuple or CallNode object. Again, we need help from TVM community to provide
+  suggestions/guidelines here in order to know whether this is one of the best ways to achieve the Mrvl-BYOC need.

Review comment:
       so is the idea that the exported graph contains the en_id and then someone can trace that back to an annotated Relay program? what's the procedure by which en_id could be used?

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;

Review comment:
       got it. so just to clarify--you're proposing to fuse these merge-composite IR functions at the Relay level into a single e.g. Relay `@main`? I think another strategy would be to run a TIR-only pass after scheduling. curious if that may work to accomplish the same goals? the benefit there is that you can also operate, at that time, on the logical TIR buffers.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] areusch commented on pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
areusch commented on pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#issuecomment-1026484630


   @ccjoechou great thanks! i'll take another look here when you're done with the updates.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r797197238



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
+    * Some common classes have been moved from the original src/relay/backend/graph_executor_codegen.cc file to the
+      new src/relay/backend/graph_executor_codegen.h file so that they can be shared by Marvell-specific functions
+      and derived classes defined in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+
+    * new definitions are listed below:
+```
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.h file
+    /*! \brief Node types */
+    enum GraphNodeType {
+      kGraphNop,
+      kGraphInputNode,
+      kGraphOpNode,
+      kGraphInputNodeExt,
+      kGraphOpNodeExt,
+    };
+
+    
+    class ExternalJsonWriterCB {
+     public:
+      template <class T>
+      void RegisterCB(T* const object,
+                      void (T::*const mf)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                          std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        using namespace std::placeholders;
+        callback_ = std::bind(mf, object, _1, _2, _3, _4);
+        hasCallback_ = true;
+      }
+      void RegisterCB(void (*const fun)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                        std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        callback_ = fun;
+        hasCallback_ = true;
+      }
+      void Exe(dmlc::JSONWriter* external_writer, Array<tvm::runtime::Module> mod,
+               std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads) {
+        ICHECK(hasCallback_) << "ERROR: no registered callback";
+        callback_(external_writer, mod, nodes, heads);
+      }
+      inline bool HasCallback() { return hasCallback_; }
+
+     private:
+      std::function<void(dmlc::JSONWriter*, Array<tvm::runtime::Module>, std::vector<GraphObjectPtr>,
+                         std::vector<GraphNodeRef>)>
+          callback_;
+      bool hasCallback_{false};
+    };
+
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.cc file
+    class GraphExecutorCodegen : public backend::MemoizedExprTranslator<std::vector<GraphNodeRef>> {
+     public:
+      GraphExecutorCodegen(runtime::Module* mod, const TargetMap& targets)
+          : mod_(mod), targets_(targets) {
+        // we need the following variable to be a static member of the class so we can access
+        //   its setting in the following static GetExternalJsonWriter() function; but this static
+        //   member can actually be used as a local Callback setting for "per" GraphExecutorCodegen
+        //   instantiation during each TVM build-codegen flow
+        external_json_writer_ = std::make_shared<ExternalJsonWriterCB>();
+        ICHECK(external_json_writer_);
+      }
+      static ExternalJsonWriterCB* GetExternalJsonWriter() { return external_json_writer_.get(); }
+      ....
+      LoweredOutput Codegen(IRModule mod, relay::Function func, String mod_name) {
+        ....
+
+        // if it has been registered for this GraphExecutorCodegen object, call the external JSON writer
+        if (external_json_writer_->HasCallback()) {
+          std::ostringstream external_os;
+          dmlc::JSONWriter external_writer(&external_os);
+          external_json_writer_->Exe(&external_writer, ret.external_mods, nodes_, heads_);
+          ret.external_graph_json = external_os.str();
+        }
+
+        return ret;
+      }
+    };
+
+    extern "C" ExternalJsonWriterCB* GetExternalJsonWriter() {
+      return GraphExecutorCodegen::GetExternalJsonWriter();
+    }
+
+    /////////////
+    // in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+    // Marvell-specific extentions
+    class GraphInputNodeMrvlExt : public GraphInputNode {
+        ...
+        GraphNodeType Type() const override { return kGraphInputNodeExt; }
+        void Save(dmlc::JSONWriter* writer) const override { /* extensions */ }
+    }
+
+    class GraphOpNodeMrvlExt : public GraphOpNode {
+        ...
+        GraphNodeType Type() const override { return kGraphOpNodeExt; }
+        void Load(dmlc::JSONReader* reader) override;
+        void LoadAttrs(dmlc::JSONReader* reader);
+        std::pair<std::string, GraphAttrs> GetLoadedGraphAttrs();
+    }
+
+    class MrvlExtJson {
+     public:
+      MrvlExtJson() {
+        ICHECK(!GetExternalJsonWriter()->HasCallback()) << "ERROR: has registered callback";
+        GetExternalJsonWriter()->RegisterCB(this, &MrvlExtJson::GetExternalJSON);
+      }
+      virtual ~MrvlExtJson() {}
+      void GetExternalJSON(dmlc::JSONWriter* writer, Array<tvm::runtime::Module> external_mods,
+                           std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads);
+      void LoadExternalJsonAttrs(std::unordered_map<std::string, GraphAttrs>* external_attrs_map,
+                                 const Array<tvm::runtime::Module>& external_mods);
+    };
+```
+
+* the need to link between pre-trained model and final Marvell backend layer - for instance, through tvm_custom
+    * We did not include prototype code in PR-9730 but intend to provide our sample changes in another RFC and PR.
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+* We haven't identified any major *not* do items. Several other designs are by choices - that is we understand that
+  there are benefits for doing or benefits for not-doing.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+* We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.

Review comment:
       @areusch: We may not know enough but like to learn more about the runtime.Module subject (its flow and how to add specialization). Can you provide an example or an existing tvm/tests-folder suite as a pointer for us to run an existing runtime using tvm repo (without relying on specific USE_<vendor> flags being ON), if there is such example?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#issuecomment-1033058499


   > @ccjoechou hey I think you may have had a bad merge--I see a bunch of unrelated RFCs listed as changed underneath "Files changed." Could you take a look and rebase/re-merge?
   
   Let me check.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#issuecomment-1040923708


   @areusch: No worries. I saw lots TVM emails coming from you & others working on other also important stuffs. We will wait for your feedback.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#issuecomment-1033131566


   @areusch:
   Hello, I believe I have cleaned up the PR's commits now (reset a few commits and re-add changes).
   Sorry about that.
   Can you take a look again (my latest changes are with this commit: 8a7fd01)?
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#issuecomment-1022454837


   @areusch: Thanks again for your latest responses. Don't worry about timing (we all have our real jobs to do too).
   I am going to update the RFC #48 doc based on many of your feedback in the next few days -- including to add figures by following the example you gave and to add descriptions to clarify many good questions of reviewers.
   Once that is done, we will reply to your latest responses individually. For changing the en_id name to exprnode_id, we will update our POC-up-streamed TVM PR accordingly first.
   Btw, in our RFC flow, we mentioned that we need to call the relay defused-op pass convert merge-composition-ed Relay IR back to w/o-composition Relay IR. We found a bug in the Relay Defuses-op pass and had up-streamed a fix and a test case to TVM (see TVM PR-10069 for details).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r787278142



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute

Review comment:
       en_id as ExprNode ID. It is an extra field, which has been defined in the include/./tvm/ir/expr.h file for the RelayExprNode or just ExprNode class




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r787275464



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included

Review comment:
       for this BYOC-Marvell RFC, the POC PR codebase only contains code to generate JSON meta files. We have up-streamed our test_mrvl test suite but only contains JSON codegen. In our next RFC, we will provide runtime & driver hookups. We are working on a Marvell backend package with Marvell backend code-gen and Marvell software simulator, which mimics a cycle-approximate Marvell HW accelerator. This package can become available later for external usage.
   Currently, we are having problems run TVM rust/cargo and can’t find useful document to debug issues – plus, tvm-build is owned by OctoML (not GitHub TVM, right?)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] areusch commented on pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
areusch commented on pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#issuecomment-1033040039


   @ccjoechou hey I think you may have had a bad merge--I see a bunch of unrelated RFCs listed as changed underneath "Files changed." Could you take a look and rebase/re-merge?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r797197944



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
+    * Some common classes have been moved from the original src/relay/backend/graph_executor_codegen.cc file to the
+      new src/relay/backend/graph_executor_codegen.h file so that they can be shared by Marvell-specific functions
+      and derived classes defined in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+
+    * new definitions are listed below:
+```
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.h file
+    /*! \brief Node types */
+    enum GraphNodeType {
+      kGraphNop,
+      kGraphInputNode,
+      kGraphOpNode,
+      kGraphInputNodeExt,
+      kGraphOpNodeExt,
+    };
+
+    
+    class ExternalJsonWriterCB {
+     public:
+      template <class T>
+      void RegisterCB(T* const object,
+                      void (T::*const mf)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                          std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        using namespace std::placeholders;
+        callback_ = std::bind(mf, object, _1, _2, _3, _4);
+        hasCallback_ = true;
+      }
+      void RegisterCB(void (*const fun)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                        std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        callback_ = fun;
+        hasCallback_ = true;
+      }
+      void Exe(dmlc::JSONWriter* external_writer, Array<tvm::runtime::Module> mod,
+               std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads) {
+        ICHECK(hasCallback_) << "ERROR: no registered callback";
+        callback_(external_writer, mod, nodes, heads);
+      }
+      inline bool HasCallback() { return hasCallback_; }
+
+     private:
+      std::function<void(dmlc::JSONWriter*, Array<tvm::runtime::Module>, std::vector<GraphObjectPtr>,
+                         std::vector<GraphNodeRef>)>
+          callback_;
+      bool hasCallback_{false};
+    };
+
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.cc file
+    class GraphExecutorCodegen : public backend::MemoizedExprTranslator<std::vector<GraphNodeRef>> {
+     public:
+      GraphExecutorCodegen(runtime::Module* mod, const TargetMap& targets)
+          : mod_(mod), targets_(targets) {
+        // we need the following variable to be a static member of the class so we can access
+        //   its setting in the following static GetExternalJsonWriter() function; but this static
+        //   member can actually be used as a local Callback setting for "per" GraphExecutorCodegen
+        //   instantiation during each TVM build-codegen flow
+        external_json_writer_ = std::make_shared<ExternalJsonWriterCB>();
+        ICHECK(external_json_writer_);
+      }
+      static ExternalJsonWriterCB* GetExternalJsonWriter() { return external_json_writer_.get(); }
+      ....
+      LoweredOutput Codegen(IRModule mod, relay::Function func, String mod_name) {
+        ....
+
+        // if it has been registered for this GraphExecutorCodegen object, call the external JSON writer
+        if (external_json_writer_->HasCallback()) {
+          std::ostringstream external_os;
+          dmlc::JSONWriter external_writer(&external_os);
+          external_json_writer_->Exe(&external_writer, ret.external_mods, nodes_, heads_);
+          ret.external_graph_json = external_os.str();
+        }
+
+        return ret;
+      }
+    };
+
+    extern "C" ExternalJsonWriterCB* GetExternalJsonWriter() {
+      return GraphExecutorCodegen::GetExternalJsonWriter();
+    }
+
+    /////////////
+    // in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+    // Marvell-specific extentions
+    class GraphInputNodeMrvlExt : public GraphInputNode {
+        ...
+        GraphNodeType Type() const override { return kGraphInputNodeExt; }
+        void Save(dmlc::JSONWriter* writer) const override { /* extensions */ }
+    }
+
+    class GraphOpNodeMrvlExt : public GraphOpNode {
+        ...
+        GraphNodeType Type() const override { return kGraphOpNodeExt; }
+        void Load(dmlc::JSONReader* reader) override;
+        void LoadAttrs(dmlc::JSONReader* reader);
+        std::pair<std::string, GraphAttrs> GetLoadedGraphAttrs();
+    }
+
+    class MrvlExtJson {
+     public:
+      MrvlExtJson() {
+        ICHECK(!GetExternalJsonWriter()->HasCallback()) << "ERROR: has registered callback";
+        GetExternalJsonWriter()->RegisterCB(this, &MrvlExtJson::GetExternalJSON);
+      }
+      virtual ~MrvlExtJson() {}
+      void GetExternalJSON(dmlc::JSONWriter* writer, Array<tvm::runtime::Module> external_mods,
+                           std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads);
+      void LoadExternalJsonAttrs(std::unordered_map<std::string, GraphAttrs>* external_attrs_map,
+                                 const Array<tvm::runtime::Module>& external_mods);
+    };
+```
+
+* the need to link between pre-trained model and final Marvell backend layer - for instance, through tvm_custom
+    * We did not include prototype code in PR-9730 but intend to provide our sample changes in another RFC and PR.
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+* We haven't identified any major *not* do items. Several other designs are by choices - that is we understand that
+  there are benefits for doing or benefits for not-doing.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+* We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.
+
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+* We are following the existing TVM BYOC framework and example files.
+    * for example: to do IR compositions, to define own IR passes, to mix implementations in Python/C++, and etc.
+
+* We have extended graph_executor_codegen.cc and JSON loader/saver in order to read and write out Marvell specific
+  attributes
+
+* Currently, we haven't spend enough time to under how tvm/rust/cargo requirements and steps. Therefore, we are
+  bypassing the tvm/Jenkinsfile's tests/scripts/task_rust.sh step. We will need help to re-enable the step.
+
+* We like to duplicate the Jenkins environment in order to run tvm/Jenkinsfile as is, but, we ran into many issues.
+  Currently, we have a tvm-like Jenksinsfile environment to only run a subset of test suites using a modified
+  Jenkinsfile.
+
+* We have identified a need to allow a call-back function to be registered when generating Mrvl-BYOC-specific
+  Nodes-JSON file. We are trying to follow TVM Python/CPP-CB style as much as possible. But, since our callback
+  function tvm/src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc::GetExternalJSON() function is using
+  non-simple argument types, we need help from TVM community to provide suggestions/guidelines in order to make
+  new CB code better to meet TVM community requirements here.
+
+* For one Mrvl-BYOC relay transformation pass, we have identified a need to inject a (global) expr node ID for the
+  RelayExprNode class and its derived classes: Tuple and CallNode, so that during the transformation pass, we can
+  uniquely identify each Tuple or CallNode object. Again, we need help from TVM community to provide
+  suggestions/guidelines here in order to know whether this is one of the best ways to achieve the Mrvl-BYOC need.
+
+* We also identified a need to maintain linkages between (operator-)information described in the original, given
+  pre-trained network model and the code-gen JSON files so that the compiler backend will be able to report user-level
+  (e.g., meaningful-to-user) messages regarding the given pre-trained network. For instance, in the
+  tvm/python/tvm/relay/frontend/onnx.py and common.py files, we can see user-level information being captured using
+  “tvm_custom” related code as in original onnx.py file for the given pre-trained network; but, in common.py, the code
+  later drops the linkage, via attrs.pop(“tvm_custom”), and does not pass the linkage onto the initial relay IR graph.
+  We have a draft solution to maintain linkages between the given pre-trained network model and its relay IR graph
+  (using expr node ID and tvm custom ID, plus, a few utility functions), but would like to know whether the TVM
+  community has any better or work-in-progress resolution.
+
+* When using TVM RPC code to exercise and run inference on a remote-hosted Mrvl ML/AI HW accelerator for the Mrvl
+  subgraph, we ran into one minor issue and have made local TVM RPC enhancement so that, when a TVM RPC client sends

Review comment:
       yes and we will find time to check.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] mbs-octoml commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
mbs-octoml commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r787217701



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability

Review comment:
       Hi, thanks for the RFC. My team at OctoML is looking at bringing some training features to the BYOC world (a la https://arxiv.org/pdf/2111.00655.pdf), so I'm looking at this RFC with that future in mind. Can you expand on:
    - Is the fusion using the existing MergeComposite / AnnotateTarget/ MergeCompilerRegions(maybe) / PartitionGraph sequence?
    - other than the global layout xform, which necessarily must be done before any fusion etc, are there any other xforms before the above partitioning takes place?
    - can you explain the need to limit to one kernel for each of your byoc and the default tvm? Perhaps it's an artifact of how you're later trying to capture the byoc output in json graph form? Ideally the BYOC target.ext.<your name> function could be run multiple times, the resulting runtime::Module would be accumulated in the IRModule, and the runtime::Modules later merged. Perhaps supporting that would actually be easier and would remove the at-most-one kernel limit?
    - Ideally there'd be a single entry point for 'partition for marvel', after which the regular TVM build would deal with fusion, lowering and codegen for everything that's left (ie overall model - kernels you already partitioned out). I may not be following the explanation but it seems you're proposing the driver splits things more explicitly.
    - Like @areusch  I'm a bit confused by the special handling of the graph. Perhaps it would be worth going through the tensorrt BYOC integration as a reference example since it too collects a JSON representation of the to-be-complied fused sub-graph (we invoke the TensorRT build function at runtime not compile time), but it does so on top of existing machinery. 
   
   Let me know if it would be easier to discuss this on a PR rather than here, then we could come back to here.   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#issuecomment-1050373729


   @areusch:
   Yes, using one or two zoom sessions to go over some questions of yours and some questions of ours will definitely be very, very helpful. This way we can sync up quicker.
   I am open tomorrow (2/25) as well as Monday (2/28) in the afternoons from 1:30 pm to 6 pm Pacific time zone.
   Will you be available?
   My company email address is: cchou1@marvell.com.
   Please feel free to schedule a zoom meeting using my email.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] areusch commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
areusch commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r787106344



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included

Review comment:
       what's the test plan for this RFC? Would it be possible to add the Marvell backend compiler and simulator to our ci images and run against it in CI?

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run

Review comment:
       just checking if this was the end of the sentence here

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute

Review comment:
       could you motivate the naming of `en_id` a bit? i recognize this is a common thing, but it might be nice to choose a slightly more specific name

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
+    * Some common classes have been moved from the original src/relay/backend/graph_executor_codegen.cc file to the
+      new src/relay/backend/graph_executor_codegen.h file so that they can be shared by Marvell-specific functions
+      and derived classes defined in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+
+    * new definitions are listed below:
+```
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.h file
+    /*! \brief Node types */
+    enum GraphNodeType {
+      kGraphNop,
+      kGraphInputNode,
+      kGraphOpNode,
+      kGraphInputNodeExt,
+      kGraphOpNodeExt,
+    };
+
+    
+    class ExternalJsonWriterCB {
+     public:
+      template <class T>
+      void RegisterCB(T* const object,
+                      void (T::*const mf)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                          std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        using namespace std::placeholders;
+        callback_ = std::bind(mf, object, _1, _2, _3, _4);
+        hasCallback_ = true;
+      }
+      void RegisterCB(void (*const fun)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                        std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        callback_ = fun;
+        hasCallback_ = true;
+      }
+      void Exe(dmlc::JSONWriter* external_writer, Array<tvm::runtime::Module> mod,
+               std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads) {
+        ICHECK(hasCallback_) << "ERROR: no registered callback";
+        callback_(external_writer, mod, nodes, heads);
+      }
+      inline bool HasCallback() { return hasCallback_; }
+
+     private:
+      std::function<void(dmlc::JSONWriter*, Array<tvm::runtime::Module>, std::vector<GraphObjectPtr>,
+                         std::vector<GraphNodeRef>)>
+          callback_;
+      bool hasCallback_{false};
+    };
+
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.cc file
+    class GraphExecutorCodegen : public backend::MemoizedExprTranslator<std::vector<GraphNodeRef>> {
+     public:
+      GraphExecutorCodegen(runtime::Module* mod, const TargetMap& targets)
+          : mod_(mod), targets_(targets) {
+        // we need the following variable to be a static member of the class so we can access
+        //   its setting in the following static GetExternalJsonWriter() function; but this static
+        //   member can actually be used as a local Callback setting for "per" GraphExecutorCodegen
+        //   instantiation during each TVM build-codegen flow
+        external_json_writer_ = std::make_shared<ExternalJsonWriterCB>();
+        ICHECK(external_json_writer_);
+      }
+      static ExternalJsonWriterCB* GetExternalJsonWriter() { return external_json_writer_.get(); }
+      ....
+      LoweredOutput Codegen(IRModule mod, relay::Function func, String mod_name) {
+        ....
+
+        // if it has been registered for this GraphExecutorCodegen object, call the external JSON writer
+        if (external_json_writer_->HasCallback()) {
+          std::ostringstream external_os;
+          dmlc::JSONWriter external_writer(&external_os);
+          external_json_writer_->Exe(&external_writer, ret.external_mods, nodes_, heads_);
+          ret.external_graph_json = external_os.str();
+        }
+
+        return ret;
+      }
+    };
+
+    extern "C" ExternalJsonWriterCB* GetExternalJsonWriter() {
+      return GraphExecutorCodegen::GetExternalJsonWriter();
+    }
+
+    /////////////
+    // in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+    // Marvell-specific extentions
+    class GraphInputNodeMrvlExt : public GraphInputNode {
+        ...
+        GraphNodeType Type() const override { return kGraphInputNodeExt; }
+        void Save(dmlc::JSONWriter* writer) const override { /* extensions */ }
+    }
+
+    class GraphOpNodeMrvlExt : public GraphOpNode {
+        ...
+        GraphNodeType Type() const override { return kGraphOpNodeExt; }
+        void Load(dmlc::JSONReader* reader) override;
+        void LoadAttrs(dmlc::JSONReader* reader);
+        std::pair<std::string, GraphAttrs> GetLoadedGraphAttrs();
+    }
+
+    class MrvlExtJson {
+     public:
+      MrvlExtJson() {
+        ICHECK(!GetExternalJsonWriter()->HasCallback()) << "ERROR: has registered callback";
+        GetExternalJsonWriter()->RegisterCB(this, &MrvlExtJson::GetExternalJSON);
+      }
+      virtual ~MrvlExtJson() {}
+      void GetExternalJSON(dmlc::JSONWriter* writer, Array<tvm::runtime::Module> external_mods,
+                           std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads);
+      void LoadExternalJsonAttrs(std::unordered_map<std::string, GraphAttrs>* external_attrs_map,
+                                 const Array<tvm::runtime::Module>& external_mods);
+    };
+```
+
+* the need to link between pre-trained model and final Marvell backend layer - for instance, through tvm_custom
+    * We did not include prototype code in PR-9730 but intend to provide our sample changes in another RFC and PR.
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+* We haven't identified any major *not* do items. Several other designs are by choices - that is we understand that
+  there are benefits for doing or benefits for not-doing.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+* We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.
+
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+* We are following the existing TVM BYOC framework and example files.
+    * for example: to do IR compositions, to define own IR passes, to mix implementations in Python/C++, and etc.
+
+* We have extended graph_executor_codegen.cc and JSON loader/saver in order to read and write out Marvell specific
+  attributes
+
+* Currently, we haven't spend enough time to under how tvm/rust/cargo requirements and steps. Therefore, we are
+  bypassing the tvm/Jenkinsfile's tests/scripts/task_rust.sh step. We will need help to re-enable the step.
+
+* We like to duplicate the Jenkins environment in order to run tvm/Jenkinsfile as is, but, we ran into many issues.
+  Currently, we have a tvm-like Jenksinsfile environment to only run a subset of test suites using a modified
+  Jenkinsfile.
+
+* We have identified a need to allow a call-back function to be registered when generating Mrvl-BYOC-specific
+  Nodes-JSON file. We are trying to follow TVM Python/CPP-CB style as much as possible. But, since our callback
+  function tvm/src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc::GetExternalJSON() function is using
+  non-simple argument types, we need help from TVM community to provide suggestions/guidelines in order to make
+  new CB code better to meet TVM community requirements here.
+
+* For one Mrvl-BYOC relay transformation pass, we have identified a need to inject a (global) expr node ID for the
+  RelayExprNode class and its derived classes: Tuple and CallNode, so that during the transformation pass, we can
+  uniquely identify each Tuple or CallNode object. Again, we need help from TVM community to provide
+  suggestions/guidelines here in order to know whether this is one of the best ways to achieve the Mrvl-BYOC need.
+
+* We also identified a need to maintain linkages between (operator-)information described in the original, given
+  pre-trained network model and the code-gen JSON files so that the compiler backend will be able to report user-level
+  (e.g., meaningful-to-user) messages regarding the given pre-trained network. For instance, in the
+  tvm/python/tvm/relay/frontend/onnx.py and common.py files, we can see user-level information being captured using
+  “tvm_custom” related code as in original onnx.py file for the given pre-trained network; but, in common.py, the code
+  later drops the linkage, via attrs.pop(“tvm_custom”), and does not pass the linkage onto the initial relay IR graph.
+  We have a draft solution to maintain linkages between the given pre-trained network model and its relay IR graph
+  (using expr node ID and tvm custom ID, plus, a few utility functions), but would like to know whether the TVM
+  community has any better or work-in-progress resolution.
+
+* When using TVM RPC code to exercise and run inference on a remote-hosted Mrvl ML/AI HW accelerator for the Mrvl
+  subgraph, we ran into one minor issue and have made local TVM RPC enhancement so that, when a TVM RPC client sends

Review comment:
       could you explain the nature of the problem that requires the client to know the absolute path?

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for

Review comment:
       could you clarify this sentence?

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs

Review comment:
       this process looks rather similar to the device planning pass used in `tvm.relay.build`. are they the same? if not, could you motivate why you don't want to reuse that one?

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;

Review comment:
       above, the RFC discusses having exactly one Marvell and non-Marvell subcgraph, but here I see 8 different function calls. do you mean that there are two targets, and you partition the graph into 8 subgraphs, but each subgraph is assigned to one or the other target? (reading further, I can see this is not the case, but it would help with reader comprehension to clarify this example)

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:

Review comment:
       could you motivate this further? it's hard to understand why you need to output your own JSON format without some explanation here.

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
+    * Some common classes have been moved from the original src/relay/backend/graph_executor_codegen.cc file to the
+      new src/relay/backend/graph_executor_codegen.h file so that they can be shared by Marvell-specific functions
+      and derived classes defined in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+
+    * new definitions are listed below:
+```
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.h file
+    /*! \brief Node types */
+    enum GraphNodeType {
+      kGraphNop,
+      kGraphInputNode,
+      kGraphOpNode,
+      kGraphInputNodeExt,
+      kGraphOpNodeExt,
+    };
+
+    
+    class ExternalJsonWriterCB {
+     public:
+      template <class T>
+      void RegisterCB(T* const object,
+                      void (T::*const mf)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                          std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        using namespace std::placeholders;
+        callback_ = std::bind(mf, object, _1, _2, _3, _4);
+        hasCallback_ = true;
+      }
+      void RegisterCB(void (*const fun)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                        std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        callback_ = fun;
+        hasCallback_ = true;
+      }
+      void Exe(dmlc::JSONWriter* external_writer, Array<tvm::runtime::Module> mod,
+               std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads) {
+        ICHECK(hasCallback_) << "ERROR: no registered callback";
+        callback_(external_writer, mod, nodes, heads);
+      }
+      inline bool HasCallback() { return hasCallback_; }
+
+     private:
+      std::function<void(dmlc::JSONWriter*, Array<tvm::runtime::Module>, std::vector<GraphObjectPtr>,
+                         std::vector<GraphNodeRef>)>
+          callback_;
+      bool hasCallback_{false};
+    };
+
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.cc file
+    class GraphExecutorCodegen : public backend::MemoizedExprTranslator<std::vector<GraphNodeRef>> {
+     public:
+      GraphExecutorCodegen(runtime::Module* mod, const TargetMap& targets)
+          : mod_(mod), targets_(targets) {
+        // we need the following variable to be a static member of the class so we can access
+        //   its setting in the following static GetExternalJsonWriter() function; but this static
+        //   member can actually be used as a local Callback setting for "per" GraphExecutorCodegen
+        //   instantiation during each TVM build-codegen flow
+        external_json_writer_ = std::make_shared<ExternalJsonWriterCB>();
+        ICHECK(external_json_writer_);
+      }
+      static ExternalJsonWriterCB* GetExternalJsonWriter() { return external_json_writer_.get(); }
+      ....
+      LoweredOutput Codegen(IRModule mod, relay::Function func, String mod_name) {
+        ....
+
+        // if it has been registered for this GraphExecutorCodegen object, call the external JSON writer
+        if (external_json_writer_->HasCallback()) {
+          std::ostringstream external_os;
+          dmlc::JSONWriter external_writer(&external_os);
+          external_json_writer_->Exe(&external_writer, ret.external_mods, nodes_, heads_);
+          ret.external_graph_json = external_os.str();
+        }
+
+        return ret;
+      }
+    };
+
+    extern "C" ExternalJsonWriterCB* GetExternalJsonWriter() {
+      return GraphExecutorCodegen::GetExternalJsonWriter();
+    }
+
+    /////////////
+    // in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+    // Marvell-specific extentions
+    class GraphInputNodeMrvlExt : public GraphInputNode {
+        ...
+        GraphNodeType Type() const override { return kGraphInputNodeExt; }
+        void Save(dmlc::JSONWriter* writer) const override { /* extensions */ }
+    }
+
+    class GraphOpNodeMrvlExt : public GraphOpNode {
+        ...
+        GraphNodeType Type() const override { return kGraphOpNodeExt; }
+        void Load(dmlc::JSONReader* reader) override;
+        void LoadAttrs(dmlc::JSONReader* reader);
+        std::pair<std::string, GraphAttrs> GetLoadedGraphAttrs();
+    }
+
+    class MrvlExtJson {
+     public:
+      MrvlExtJson() {
+        ICHECK(!GetExternalJsonWriter()->HasCallback()) << "ERROR: has registered callback";
+        GetExternalJsonWriter()->RegisterCB(this, &MrvlExtJson::GetExternalJSON);
+      }
+      virtual ~MrvlExtJson() {}
+      void GetExternalJSON(dmlc::JSONWriter* writer, Array<tvm::runtime::Module> external_mods,
+                           std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads);
+      void LoadExternalJsonAttrs(std::unordered_map<std::string, GraphAttrs>* external_attrs_map,
+                                 const Array<tvm::runtime::Module>& external_mods);
+    };
+```
+
+* the need to link between pre-trained model and final Marvell backend layer - for instance, through tvm_custom
+    * We did not include prototype code in PR-9730 but intend to provide our sample changes in another RFC and PR.
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+* We haven't identified any major *not* do items. Several other designs are by choices - that is we understand that
+  there are benefits for doing or benefits for not-doing.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+* We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.
+
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+* We are following the existing TVM BYOC framework and example files.
+    * for example: to do IR compositions, to define own IR passes, to mix implementations in Python/C++, and etc.
+
+* We have extended graph_executor_codegen.cc and JSON loader/saver in order to read and write out Marvell specific
+  attributes
+
+* Currently, we haven't spend enough time to under how tvm/rust/cargo requirements and steps. Therefore, we are
+  bypassing the tvm/Jenkinsfile's tests/scripts/task_rust.sh step. We will need help to re-enable the step.
+
+* We like to duplicate the Jenkins environment in order to run tvm/Jenkinsfile as is, but, we ran into many issues.
+  Currently, we have a tvm-like Jenksinsfile environment to only run a subset of test suites using a modified
+  Jenkinsfile.
+
+* We have identified a need to allow a call-back function to be registered when generating Mrvl-BYOC-specific
+  Nodes-JSON file. We are trying to follow TVM Python/CPP-CB style as much as possible. But, since our callback
+  function tvm/src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc::GetExternalJSON() function is using
+  non-simple argument types, we need help from TVM community to provide suggestions/guidelines in order to make
+  new CB code better to meet TVM community requirements here.
+
+* For one Mrvl-BYOC relay transformation pass, we have identified a need to inject a (global) expr node ID for the
+  RelayExprNode class and its derived classes: Tuple and CallNode, so that during the transformation pass, we can
+  uniquely identify each Tuple or CallNode object. Again, we need help from TVM community to provide
+  suggestions/guidelines here in order to know whether this is one of the best ways to achieve the Mrvl-BYOC need.
+
+* We also identified a need to maintain linkages between (operator-)information described in the original, given
+  pre-trained network model and the code-gen JSON files so that the compiler backend will be able to report user-level
+  (e.g., meaningful-to-user) messages regarding the given pre-trained network. For instance, in the
+  tvm/python/tvm/relay/frontend/onnx.py and common.py files, we can see user-level information being captured using
+  “tvm_custom” related code as in original onnx.py file for the given pre-trained network; but, in common.py, the code
+  later drops the linkage, via attrs.pop(“tvm_custom”), and does not pass the linkage onto the initial relay IR graph.
+  We have a draft solution to maintain linkages between the given pre-trained network model and its relay IR graph
+  (using expr node ID and tvm custom ID, plus, a few utility functions), but would like to know whether the TVM
+  community has any better or work-in-progress resolution.
+
+* When using TVM RPC code to exercise and run inference on a remote-hosted Mrvl ML/AI HW accelerator for the Mrvl
+  subgraph, we ran into one minor issue and have made local TVM RPC enhancement so that, when a TVM RPC client sends
+  a file to the remote server, the TVM RPC client can know where the remote server saves the file on the remote machine.
+  Since this is not directly related to this Mrvl-BYOC PR, we will find time to contribute this enhance back in another
+  TVM PR soon.
+
+* In order for us to generate the constants-JSON file, we must “NOT” remove external params, which were stored in

Review comment:
       why is this? params passed in MetadataModule are meant for consumption only by the `runtime.Module` which defines them. it seems like perhaps you need to consume them at the executor level. could you explain that?

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
+    * Some common classes have been moved from the original src/relay/backend/graph_executor_codegen.cc file to the
+      new src/relay/backend/graph_executor_codegen.h file so that they can be shared by Marvell-specific functions
+      and derived classes defined in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+
+    * new definitions are listed below:
+```
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.h file
+    /*! \brief Node types */
+    enum GraphNodeType {
+      kGraphNop,
+      kGraphInputNode,
+      kGraphOpNode,
+      kGraphInputNodeExt,
+      kGraphOpNodeExt,
+    };
+
+    
+    class ExternalJsonWriterCB {
+     public:
+      template <class T>
+      void RegisterCB(T* const object,
+                      void (T::*const mf)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                          std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        using namespace std::placeholders;
+        callback_ = std::bind(mf, object, _1, _2, _3, _4);
+        hasCallback_ = true;
+      }
+      void RegisterCB(void (*const fun)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                        std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        callback_ = fun;
+        hasCallback_ = true;
+      }
+      void Exe(dmlc::JSONWriter* external_writer, Array<tvm::runtime::Module> mod,
+               std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads) {
+        ICHECK(hasCallback_) << "ERROR: no registered callback";
+        callback_(external_writer, mod, nodes, heads);
+      }
+      inline bool HasCallback() { return hasCallback_; }
+
+     private:
+      std::function<void(dmlc::JSONWriter*, Array<tvm::runtime::Module>, std::vector<GraphObjectPtr>,
+                         std::vector<GraphNodeRef>)>
+          callback_;
+      bool hasCallback_{false};
+    };
+
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.cc file
+    class GraphExecutorCodegen : public backend::MemoizedExprTranslator<std::vector<GraphNodeRef>> {
+     public:
+      GraphExecutorCodegen(runtime::Module* mod, const TargetMap& targets)
+          : mod_(mod), targets_(targets) {
+        // we need the following variable to be a static member of the class so we can access
+        //   its setting in the following static GetExternalJsonWriter() function; but this static
+        //   member can actually be used as a local Callback setting for "per" GraphExecutorCodegen
+        //   instantiation during each TVM build-codegen flow
+        external_json_writer_ = std::make_shared<ExternalJsonWriterCB>();
+        ICHECK(external_json_writer_);
+      }
+      static ExternalJsonWriterCB* GetExternalJsonWriter() { return external_json_writer_.get(); }
+      ....
+      LoweredOutput Codegen(IRModule mod, relay::Function func, String mod_name) {
+        ....
+
+        // if it has been registered for this GraphExecutorCodegen object, call the external JSON writer
+        if (external_json_writer_->HasCallback()) {
+          std::ostringstream external_os;
+          dmlc::JSONWriter external_writer(&external_os);
+          external_json_writer_->Exe(&external_writer, ret.external_mods, nodes_, heads_);
+          ret.external_graph_json = external_os.str();
+        }
+
+        return ret;
+      }
+    };
+
+    extern "C" ExternalJsonWriterCB* GetExternalJsonWriter() {
+      return GraphExecutorCodegen::GetExternalJsonWriter();
+    }
+
+    /////////////
+    // in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+    // Marvell-specific extentions
+    class GraphInputNodeMrvlExt : public GraphInputNode {
+        ...
+        GraphNodeType Type() const override { return kGraphInputNodeExt; }
+        void Save(dmlc::JSONWriter* writer) const override { /* extensions */ }
+    }
+
+    class GraphOpNodeMrvlExt : public GraphOpNode {
+        ...
+        GraphNodeType Type() const override { return kGraphOpNodeExt; }
+        void Load(dmlc::JSONReader* reader) override;
+        void LoadAttrs(dmlc::JSONReader* reader);
+        std::pair<std::string, GraphAttrs> GetLoadedGraphAttrs();
+    }
+
+    class MrvlExtJson {
+     public:
+      MrvlExtJson() {
+        ICHECK(!GetExternalJsonWriter()->HasCallback()) << "ERROR: has registered callback";
+        GetExternalJsonWriter()->RegisterCB(this, &MrvlExtJson::GetExternalJSON);
+      }
+      virtual ~MrvlExtJson() {}
+      void GetExternalJSON(dmlc::JSONWriter* writer, Array<tvm::runtime::Module> external_mods,
+                           std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads);
+      void LoadExternalJsonAttrs(std::unordered_map<std::string, GraphAttrs>* external_attrs_map,
+                                 const Array<tvm::runtime::Module>& external_mods);
+    };
+```
+
+* the need to link between pre-trained model and final Marvell backend layer - for instance, through tvm_custom
+    * We did not include prototype code in PR-9730 but intend to provide our sample changes in another RFC and PR.
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+* We haven't identified any major *not* do items. Several other designs are by choices - that is we understand that
+  there are benefits for doing or benefits for not-doing.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+* We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.
+
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+* We are following the existing TVM BYOC framework and example files.
+    * for example: to do IR compositions, to define own IR passes, to mix implementations in Python/C++, and etc.
+
+* We have extended graph_executor_codegen.cc and JSON loader/saver in order to read and write out Marvell specific
+  attributes
+
+* Currently, we haven't spend enough time to under how tvm/rust/cargo requirements and steps. Therefore, we are
+  bypassing the tvm/Jenkinsfile's tests/scripts/task_rust.sh step. We will need help to re-enable the step.
+
+* We like to duplicate the Jenkins environment in order to run tvm/Jenkinsfile as is, but, we ran into many issues.
+  Currently, we have a tvm-like Jenksinsfile environment to only run a subset of test suites using a modified
+  Jenkinsfile.
+
+* We have identified a need to allow a call-back function to be registered when generating Mrvl-BYOC-specific
+  Nodes-JSON file. We are trying to follow TVM Python/CPP-CB style as much as possible. But, since our callback
+  function tvm/src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc::GetExternalJSON() function is using
+  non-simple argument types, we need help from TVM community to provide suggestions/guidelines in order to make
+  new CB code better to meet TVM community requirements here.
+
+* For one Mrvl-BYOC relay transformation pass, we have identified a need to inject a (global) expr node ID for the
+  RelayExprNode class and its derived classes: Tuple and CallNode, so that during the transformation pass, we can
+  uniquely identify each Tuple or CallNode object. Again, we need help from TVM community to provide
+  suggestions/guidelines here in order to know whether this is one of the best ways to achieve the Mrvl-BYOC need.

Review comment:
       i think it would help to spell out why you guys need to be able to identify each expression here.

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
+    * Some common classes have been moved from the original src/relay/backend/graph_executor_codegen.cc file to the
+      new src/relay/backend/graph_executor_codegen.h file so that they can be shared by Marvell-specific functions
+      and derived classes defined in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+
+    * new definitions are listed below:
+```
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.h file
+    /*! \brief Node types */
+    enum GraphNodeType {
+      kGraphNop,
+      kGraphInputNode,
+      kGraphOpNode,
+      kGraphInputNodeExt,
+      kGraphOpNodeExt,
+    };
+
+    
+    class ExternalJsonWriterCB {
+     public:
+      template <class T>
+      void RegisterCB(T* const object,
+                      void (T::*const mf)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                          std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        using namespace std::placeholders;
+        callback_ = std::bind(mf, object, _1, _2, _3, _4);
+        hasCallback_ = true;
+      }
+      void RegisterCB(void (*const fun)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                        std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        callback_ = fun;
+        hasCallback_ = true;
+      }
+      void Exe(dmlc::JSONWriter* external_writer, Array<tvm::runtime::Module> mod,
+               std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads) {
+        ICHECK(hasCallback_) << "ERROR: no registered callback";
+        callback_(external_writer, mod, nodes, heads);
+      }
+      inline bool HasCallback() { return hasCallback_; }
+
+     private:
+      std::function<void(dmlc::JSONWriter*, Array<tvm::runtime::Module>, std::vector<GraphObjectPtr>,
+                         std::vector<GraphNodeRef>)>
+          callback_;
+      bool hasCallback_{false};
+    };
+
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.cc file
+    class GraphExecutorCodegen : public backend::MemoizedExprTranslator<std::vector<GraphNodeRef>> {
+     public:
+      GraphExecutorCodegen(runtime::Module* mod, const TargetMap& targets)
+          : mod_(mod), targets_(targets) {
+        // we need the following variable to be a static member of the class so we can access
+        //   its setting in the following static GetExternalJsonWriter() function; but this static
+        //   member can actually be used as a local Callback setting for "per" GraphExecutorCodegen
+        //   instantiation during each TVM build-codegen flow
+        external_json_writer_ = std::make_shared<ExternalJsonWriterCB>();
+        ICHECK(external_json_writer_);
+      }
+      static ExternalJsonWriterCB* GetExternalJsonWriter() { return external_json_writer_.get(); }
+      ....
+      LoweredOutput Codegen(IRModule mod, relay::Function func, String mod_name) {
+        ....
+
+        // if it has been registered for this GraphExecutorCodegen object, call the external JSON writer
+        if (external_json_writer_->HasCallback()) {
+          std::ostringstream external_os;
+          dmlc::JSONWriter external_writer(&external_os);
+          external_json_writer_->Exe(&external_writer, ret.external_mods, nodes_, heads_);
+          ret.external_graph_json = external_os.str();
+        }
+
+        return ret;
+      }
+    };
+
+    extern "C" ExternalJsonWriterCB* GetExternalJsonWriter() {
+      return GraphExecutorCodegen::GetExternalJsonWriter();
+    }
+
+    /////////////
+    // in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+    // Marvell-specific extentions
+    class GraphInputNodeMrvlExt : public GraphInputNode {
+        ...
+        GraphNodeType Type() const override { return kGraphInputNodeExt; }
+        void Save(dmlc::JSONWriter* writer) const override { /* extensions */ }
+    }
+
+    class GraphOpNodeMrvlExt : public GraphOpNode {
+        ...
+        GraphNodeType Type() const override { return kGraphOpNodeExt; }
+        void Load(dmlc::JSONReader* reader) override;
+        void LoadAttrs(dmlc::JSONReader* reader);
+        std::pair<std::string, GraphAttrs> GetLoadedGraphAttrs();
+    }
+
+    class MrvlExtJson {
+     public:
+      MrvlExtJson() {
+        ICHECK(!GetExternalJsonWriter()->HasCallback()) << "ERROR: has registered callback";
+        GetExternalJsonWriter()->RegisterCB(this, &MrvlExtJson::GetExternalJSON);
+      }
+      virtual ~MrvlExtJson() {}
+      void GetExternalJSON(dmlc::JSONWriter* writer, Array<tvm::runtime::Module> external_mods,
+                           std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads);
+      void LoadExternalJsonAttrs(std::unordered_map<std::string, GraphAttrs>* external_attrs_map,
+                                 const Array<tvm::runtime::Module>& external_mods);
+    };
+```
+
+* the need to link between pre-trained model and final Marvell backend layer - for instance, through tvm_custom
+    * We did not include prototype code in PR-9730 but intend to provide our sample changes in another RFC and PR.
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+* We haven't identified any major *not* do items. Several other designs are by choices - that is we understand that
+  there are benefits for doing or benefits for not-doing.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+* We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.

Review comment:
       it seems like there has been some impact to the GraphExecutor, and I think one point of confusion here is whether it was necessary to do that or whether you could have handled the additional runtime complexity inside a Marvell-specific `runtime.Module`. could you explain a bit further here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r787276842



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;

Review comment:
       We are talking about different definitions of “(sub-)graphs” here. In the TVM partition pass, TVM’s graph or sub-graph is a merge-composite IR function, which can contain a pre-define pattern of original frontend operators. In BYOC-Marvell RFC’s definition, a sub-graph is a connected graph of Marvell-merge-composite functions. For instance, a tvmgen_mrvl_main_4 (see below in original email), it is a TVM-partition sub-graph, which is a Marvell merge-composite function containing frontend operators: conv, add, batchnorm, tuple-get-item, relu. But a Marvell sub-graph contains, in the given test case, several Marvell merge-composite functions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r787276340



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for

Review comment:
       did not know how to include a Figure in the RFC file – but I did include figures at end of the corresponding pre-RFC on the discuss forum – please check the end of pre-RFC and its figure to see whether they can help explaining the definition of Marvell sub-graphs here. https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#issuecomment-1022618855


   @areusch: Forgot to answer your first question.
   Yes, for now, we like to generate external JSON files, only for the accelerator sub graph(s), in the BuildResult step so that we can pass them to our not-(yet-)in-TVM-general-flow, external-accelerator-specific compiler backend program to do further AOT for-inference-run optimization in order to generate more optimized ISA code to run inference for the accelerator sub graph(s).
   For the llvm sub graphs (let me use llvm here to distinguish them from those for-accelerator sub graphs), we will and do follow the TVM general flow to generate the .so lib files.
   When the Model Library format and flow become stable, plus, they can be specialized to include extra external accelerator and memory allocations & communications, we definitely like to see how we can advance the BYOC-Marvell flow to use them (with llvm .so lib files together).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r797197725



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
+    * Some common classes have been moved from the original src/relay/backend/graph_executor_codegen.cc file to the
+      new src/relay/backend/graph_executor_codegen.h file so that they can be shared by Marvell-specific functions
+      and derived classes defined in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+
+    * new definitions are listed below:
+```
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.h file
+    /*! \brief Node types */
+    enum GraphNodeType {
+      kGraphNop,
+      kGraphInputNode,
+      kGraphOpNode,
+      kGraphInputNodeExt,
+      kGraphOpNodeExt,
+    };
+
+    
+    class ExternalJsonWriterCB {
+     public:
+      template <class T>
+      void RegisterCB(T* const object,
+                      void (T::*const mf)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                          std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        using namespace std::placeholders;
+        callback_ = std::bind(mf, object, _1, _2, _3, _4);
+        hasCallback_ = true;
+      }
+      void RegisterCB(void (*const fun)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                        std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        callback_ = fun;
+        hasCallback_ = true;
+      }
+      void Exe(dmlc::JSONWriter* external_writer, Array<tvm::runtime::Module> mod,
+               std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads) {
+        ICHECK(hasCallback_) << "ERROR: no registered callback";
+        callback_(external_writer, mod, nodes, heads);
+      }
+      inline bool HasCallback() { return hasCallback_; }
+
+     private:
+      std::function<void(dmlc::JSONWriter*, Array<tvm::runtime::Module>, std::vector<GraphObjectPtr>,
+                         std::vector<GraphNodeRef>)>
+          callback_;
+      bool hasCallback_{false};
+    };
+
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.cc file
+    class GraphExecutorCodegen : public backend::MemoizedExprTranslator<std::vector<GraphNodeRef>> {
+     public:
+      GraphExecutorCodegen(runtime::Module* mod, const TargetMap& targets)
+          : mod_(mod), targets_(targets) {
+        // we need the following variable to be a static member of the class so we can access
+        //   its setting in the following static GetExternalJsonWriter() function; but this static
+        //   member can actually be used as a local Callback setting for "per" GraphExecutorCodegen
+        //   instantiation during each TVM build-codegen flow
+        external_json_writer_ = std::make_shared<ExternalJsonWriterCB>();
+        ICHECK(external_json_writer_);
+      }
+      static ExternalJsonWriterCB* GetExternalJsonWriter() { return external_json_writer_.get(); }
+      ....
+      LoweredOutput Codegen(IRModule mod, relay::Function func, String mod_name) {
+        ....
+
+        // if it has been registered for this GraphExecutorCodegen object, call the external JSON writer
+        if (external_json_writer_->HasCallback()) {
+          std::ostringstream external_os;
+          dmlc::JSONWriter external_writer(&external_os);
+          external_json_writer_->Exe(&external_writer, ret.external_mods, nodes_, heads_);
+          ret.external_graph_json = external_os.str();
+        }
+
+        return ret;
+      }
+    };
+
+    extern "C" ExternalJsonWriterCB* GetExternalJsonWriter() {
+      return GraphExecutorCodegen::GetExternalJsonWriter();
+    }
+
+    /////////////
+    // in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+    // Marvell-specific extentions
+    class GraphInputNodeMrvlExt : public GraphInputNode {
+        ...
+        GraphNodeType Type() const override { return kGraphInputNodeExt; }
+        void Save(dmlc::JSONWriter* writer) const override { /* extensions */ }
+    }
+
+    class GraphOpNodeMrvlExt : public GraphOpNode {
+        ...
+        GraphNodeType Type() const override { return kGraphOpNodeExt; }
+        void Load(dmlc::JSONReader* reader) override;
+        void LoadAttrs(dmlc::JSONReader* reader);
+        std::pair<std::string, GraphAttrs> GetLoadedGraphAttrs();
+    }
+
+    class MrvlExtJson {
+     public:
+      MrvlExtJson() {
+        ICHECK(!GetExternalJsonWriter()->HasCallback()) << "ERROR: has registered callback";
+        GetExternalJsonWriter()->RegisterCB(this, &MrvlExtJson::GetExternalJSON);
+      }
+      virtual ~MrvlExtJson() {}
+      void GetExternalJSON(dmlc::JSONWriter* writer, Array<tvm::runtime::Module> external_mods,
+                           std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads);
+      void LoadExternalJsonAttrs(std::unordered_map<std::string, GraphAttrs>* external_attrs_map,
+                                 const Array<tvm::runtime::Module>& external_mods);
+    };
+```
+
+* the need to link between pre-trained model and final Marvell backend layer - for instance, through tvm_custom
+    * We did not include prototype code in PR-9730 but intend to provide our sample changes in another RFC and PR.
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+* We haven't identified any major *not* do items. Several other designs are by choices - that is we understand that
+  there are benefits for doing or benefits for not-doing.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+* We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.
+
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+* We are following the existing TVM BYOC framework and example files.
+    * for example: to do IR compositions, to define own IR passes, to mix implementations in Python/C++, and etc.
+
+* We have extended graph_executor_codegen.cc and JSON loader/saver in order to read and write out Marvell specific
+  attributes
+
+* Currently, we haven't spend enough time to under how tvm/rust/cargo requirements and steps. Therefore, we are
+  bypassing the tvm/Jenkinsfile's tests/scripts/task_rust.sh step. We will need help to re-enable the step.
+
+* We like to duplicate the Jenkins environment in order to run tvm/Jenkinsfile as is, but, we ran into many issues.
+  Currently, we have a tvm-like Jenksinsfile environment to only run a subset of test suites using a modified
+  Jenkinsfile.
+
+* We have identified a need to allow a call-back function to be registered when generating Mrvl-BYOC-specific
+  Nodes-JSON file. We are trying to follow TVM Python/CPP-CB style as much as possible. But, since our callback
+  function tvm/src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc::GetExternalJSON() function is using
+  non-simple argument types, we need help from TVM community to provide suggestions/guidelines in order to make
+  new CB code better to meet TVM community requirements here.
+
+* For one Mrvl-BYOC relay transformation pass, we have identified a need to inject a (global) expr node ID for the
+  RelayExprNode class and its derived classes: Tuple and CallNode, so that during the transformation pass, we can
+  uniquely identify each Tuple or CallNode object. Again, we need help from TVM community to provide
+  suggestions/guidelines here in order to know whether this is one of the best ways to achieve the Mrvl-BYOC need.

Review comment:
       yes and I have update the RFC to include more information regarding exprnode_id and its usages.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] areusch commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
areusch commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r814216238



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,560 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+We follow what the TVM BYOC flow does (e.g., as done by others) to create our TVM-BYOC-Marvell POC code files and flow under the following folders -- refer to the up-loaded appache/tvm-PR-9730 POC for details:
+
+```
+  - cmake/modules/contrib/Mrvl.cmake
+  - python/tvm/relay/op/contrib/mrvl.py
+  - src/relay/backend/contrib/mrvl/codegen.cc, drop_noop_transpose.cc,
+    graph_executor_codegen_mrvl.cc
+  - src/runtime/contrib/mrvl/mrvl_runtime.cc
+  - tests/python/contrib/test_mrvl/__init__.py, infrastructure.py,
+    test_mrvl_codegen.py
+  - plus, other corresponding changes
+```
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-BYOC-Marvell AOT compilation and code-gen flow as illustrated in Figure1 and
+STEPs (1), (2), (3a), (3b), and (4) below.
+
+### Figure 1: TVM-BYOC-Marvell AOT Compilation, Code-gen Flow
+![](./assets/0048/figure1-flow.png)
+
+### STEP (1) Run TVM-BYOC-Marvell AOT ML Frontend Compilation and TVM-BYOC-Marvell code-gen using typical TVM flow.
+
+The main input to STEP (1) is a pre-trained ONNX or MXNet model; and two outputs coming out of STEP (1) include a pair of Nodes-JSON file and Constants-JSON file for each Marvell sub-graph. This pair of JSON files represents the meta-data information of a Marvell sub-graph, which is a part of the given pre-trained model identified by the TVM-BYOC-Marvell flow.
+
+Utilizing up-loaded POC changes in appache/tvm-PR-9730, sample code snippet for STEP (1) is illustrated below:
+
+```
+  import tvm
+  from tvm import relay
+  from tvm.relay.op.contrib import mrvl
+  from gluoncv import model_zoo, data, utils
+
+  ...
+
+  ssd_resnet50 = model_zoo.get_model("ssd_512_resnet50_v1_voc", pretrained=True)
+  inp_shape = (1, 3, 512, 512)
+  raw_model_ir, weight_bias_params = relay.frontend.from_mxnet(model, {"data": inp_shape})
+
+  # call mrvl.partition_for_mrvl()
+  (model_mrvl, model_other, orig_params, opt_level, disabled_pass, orig_mod,
+      mrvl_layers_in_mrvl_subgraph) = mrvl.partition_for_mrvl(
+      raw_model_ir, params=weight_bias_params, tvm_custom_dict={},
+      gen_non_mrvl_subgraph=False, flow_pass=1)
+
+  # call relay.build() and mrvl.dump_json_meta_data_files()
+  build_target, device_id = "llvm", 0
+  mod_name = relay.backend.utils.mangle_module_name("")
+  byoc_executor = relay.build(model_mrvl, target=build_target, mod_name=mod_name)
+  byoc_const_params = byoc_executor.get_params()
+  byoc_external_graph_json = byoc_executor.get_external_graph_json()
+  nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+      byoc_external_graph_json, byoc_const_params,
+      filename_prefix=f"{model_name}-tvm-mrvl-byoc-ir")
+...
+```
+
+First, we can download a pre-trained SSD-ResNet50 model from the MXNet-gluoncv site; then, call the mrvl.partition\_for\_mrvl() function to trigger the TVM-BYOC-Marvell flow; and finally, call relay.build() function and mrvl.dump\_json\_meta\_data\_files() function to generate a pair of JSON files for each Marvell sub-graph identified by the TVM-BYOC-Marvell flow.
+
+We are calling the byoc\_executor.get\_external\_graph\_json() function and the byoc\_executor.get\_params() function in order to generate both Nodes-JSON file and Constants-JSON file, respectively.
+
+* The get\_external\_graph\_json() function is a new addition to Python class BuildModule(object).
+* The get\_params() function exists for Python class BuildModule(object), but to make it work, we need to disable the "removal external params" CPP code block in the CPP class RelayBuildModule.
+
+Sub steps involved in STEP (1) are (refer to Figures 1, 2a, 2b, 3 with descriptions below):
+
+* Load pre-trained network into TVM IR graph.
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator.
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability in the accelerator.
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph.
+* Partition IR graph into one or more for-accelerator Marvell sub-graphs and/or one or more LLVM-non-Marvell sub-graphs (e.g., for running inference on ARMv9):
+
+    * These sub-graphs cover the whole pre-trained network.
+
+    * For-accelerator Marvell sub-graph here means & contains a set of connected, composite-merged/fused Call nodes (i.e., not just one compoiste-merged/fused Call node function).  NOTE: the term sub-graph defined here can be different from existing TVM sub-graph definition.
+
+    * As shown in Figure 2a, a pre-trained CNN ONNX model (on the left) is processed by the TVM-BYOC-Marvell flow into only one Marvell sub-graph (illustrated in the middle of Figure 2a) where operators of given ONNX model are composite-merged/fused into 8 fused composition function in the Marvell sub-graph. For example, near bottom left a set of MatMul + Add + Relu operators of the ONNX model are fused into one tvmgen\_mrvl\_main\_7 composition function in the Marvell sub-graph.
+
+    * As another example in Figure 2b, given the same CNN ONNX model, we can apply a different argument value but this time to ask the TVM-BYOC-Marvell flow, mrvl.partition\_for\_mrvl(...), to identify one Marvell sub-graph of 4 fused composition Call node functions and another LLVM-non-Marvell sub-graph as illustrated in the middle top sub-graph A and in the middle bottom sub-graph B, respectively.  This special argument value can lead to different inference performance in terms of meeting latency, bandwidth, and/or memory requirements.
+
+    * For the first TVM-BYOC-Marvell revision, at most one for-accelerator Marvell sub-graph and at most one LLVM-non-Marvell sub-graph can be identified; plus, the for-accelerator Marvell sub-graph can only use input tensor(s) of given pre-trained network as its sub-graph’s input tensors.
+
+    * Figure 3 illustrate how a complex Marvell sub-graph can look like. The whole sub-graph shown here represents a Marvell sub-graph of more than 100 fused compositions Call node functions and it comes from the pre-trained SSD-ResNet50 MXNet model. The LLVM-non-Marvell sub-graph part of the SSD-ResNet50 model is not displayed here but it contains rest of the object-detection part of the model in order to finalize 2D-BBOXes and labels.
+
+* Do code-gen step for each Marvell sub-graph by producing pair of Nodes-JSON and Constants-JSON files:
+
+    * The TVM-BYOC-Marvell flow also pecifies Marvell attributes for each composite-merged/fused Call node function so that generated Nodes-JSON file(s) and Constants-JSON file(s) can represent the meta-data inforamtion of Marvell sub-graph(s) in order to do post-processing.
+
+    * RFC reviewer feedback: can we identify the Marvell sub-graph by running a TIR-only pass after scheduling (with the potential benefit to also operate on the logical TIR buffers)? Marvell developer can and will spend time on understand the TIR flow and its pass to find out.
+
+![](./assets/0048/figure2a-onnx-1-mrvl-sub-graph-backend-layers.png)
+
+![](./assets/0048/figure2b-onnx-mrvl-sub-graph-A-llvm-sub-graph-B.png)
+
+![](./assets/0048/figure3-sample-mrvl-sub-graph-for-ssd-resnet50.png)
+
+
+### STEP (2) Run Marvell-ML/AI Backend Compiler to generate model binary for each Marvell sub-graph
+
+* As shown in middle left section of Figure 1, labeled as (2), we will execute, outside of the typical TVM flow, the Marvell-ML/AI backend compiler program to post-process Nodes-JSON and Constants-JSON files of each Marvell sub-graph in order to generate final ISA instructions (in a Marvell model binary file) to run inference on Marvell accelerator.
+
+* The Marvell-ML/AI backend compiler program will be distributed as: mrvl-tvmircomp. For example, the command line below can be used to generate the model binary file for a pair of CNN JSON files to run fp16-based inference by utilizing 1M bytes of On-Chip memory on each of 4 HW compute tiles:
+
+```
+  $ mrvl-tvmircomp --model_name cnn --nodes cnn-tvm-mrvl-byoc-ir.json \
+        --consts cnn-tvm-mrvl-byoc-const.json \
+        --arch=MLIP --dram_addr_relocatable=1 --ocm_base=0x0 -ocm_size=0x100000 \
+        --num_tiles=4 --quantize=float16
+
+  note: the output model binary file generated is: cnn.bin
+
+```
+
+* Marvell backend compiler does additional optimizations AOT including to group, allocate, and map layer-based tensors and computes onto pre-allocated resources (such as above: 4 compute tiles and 1M bytes on each of 4 tiles) avaialble on the Marvell accelerator.  Sample layer-based structures used by ISA instructions for the CNN model are illustrated in the right most column in both Figure 2a and Figure 2b.
+
+* Note: Marvell ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will focus only on float16 AOT compilation to run float16 inference.
+
+* Note: Marvell can provide a mrvl-tvmircomp executable to TVM CI environment to run TVM Jenkins build & tests.
+
+
+### STEP (3a) or (3b) Run inference on the Software Simulator or on the Marvell ML/AI HW accelerator for the Marvell sub-graph
+
+* As illustrated in the middle left section of Figure 1, labeled as (3a), a cycle-approximate Marvell Software Simulator, mlModel, which cycle approximately mimics the Marvell ML/AI HW accelerator, will be distributed, The Marvell Software Simulator can be used to read in a Marvell model binary file and its corresponding inference input file(s) to run inference to generate results for the Marvell sub-graph. For example, the command line below can be used to run inference:
+
+```
+  $ mlModel --model_binary cnn.bin --inputs cnn_input/input1.bin --arch=MLIP --perf_debug
+
+  note1: the inference output will be saved at: cnn-output.bin
+  note2: optionally, cycle level information for performance debug can also dump
+
+```
+
+* Note: Marvell can provide a mlModel executable to TVM CI environment to run TVM Jenkins build & tests.
+
+* Also as illustrated on the right side of Figure 1, labeled as (3b), tools, driver and firmware are available such that they can be used to run inference on an Marvell ML/AI inference HW accelerator.
+
+
+### STEP (4) Use TVM-LLVM Compiler & Runtime to run inference for the LLVM-non-Marvell sub-graph
+
+* As illustrated in the bottom left section of Figure 1, labeled as (4), an integration step between sub-graph(s) need to be done at inference runtime in order to run full inference for the given pre-trained model. We can use TVM-LLVM flow to generate runtime .so binary for each LLVM-non-Marvell sub-graph.  POC code for STEP (4) is not yet ready (WIP) and is not included in the uploaded appache/tvm-PR-9730.
+
+* For the first BYOC-Marvell revision, at most one integration step from a for-accelerator Marvell sub-graph to a LLVM-non-Marvell sub-graph is implemented.
+
+### Exercise TVM-BYOC-Marvell flow
+
+To exercise the TVM-BYOC-Marvell flow, we have provided a tests/python/contrib/test\_mrvl folder with test\_mrvl\_codegen.py and infrastructure.py files so that they shows how to exercise the TVM-BYOC-Marvell flow for a pre-trained SSD-ResNet50 model.  In addition, Marvell are also planning to provide the Marvell backend compiler (mrvl-tvmircomp) and the Marvell HW accelerator software simulator (mlModel) so that they can be used to read in JSON files generated by the TVM-BYOC-Marvell flow to run inference to get results.
+
+In the uploaded appache/tvm-PR-9730 branch,

Review comment:
       could you finish this sentence or rm?

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,560 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+We follow what the TVM BYOC flow does (e.g., as done by others) to create our TVM-BYOC-Marvell POC code files and flow under the following folders -- refer to the up-loaded appache/tvm-PR-9730 POC for details:
+
+```
+  - cmake/modules/contrib/Mrvl.cmake
+  - python/tvm/relay/op/contrib/mrvl.py
+  - src/relay/backend/contrib/mrvl/codegen.cc, drop_noop_transpose.cc,
+    graph_executor_codegen_mrvl.cc
+  - src/runtime/contrib/mrvl/mrvl_runtime.cc
+  - tests/python/contrib/test_mrvl/__init__.py, infrastructure.py,
+    test_mrvl_codegen.py
+  - plus, other corresponding changes
+```
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-BYOC-Marvell AOT compilation and code-gen flow as illustrated in Figure1 and
+STEPs (1), (2), (3a), (3b), and (4) below.
+
+### Figure 1: TVM-BYOC-Marvell AOT Compilation, Code-gen Flow
+![](./assets/0048/figure1-flow.png)
+
+### STEP (1) Run TVM-BYOC-Marvell AOT ML Frontend Compilation and TVM-BYOC-Marvell code-gen using typical TVM flow.
+
+The main input to STEP (1) is a pre-trained ONNX or MXNet model; and two outputs coming out of STEP (1) include a pair of Nodes-JSON file and Constants-JSON file for each Marvell sub-graph. This pair of JSON files represents the meta-data information of a Marvell sub-graph, which is a part of the given pre-trained model identified by the TVM-BYOC-Marvell flow.
+
+Utilizing up-loaded POC changes in appache/tvm-PR-9730, sample code snippet for STEP (1) is illustrated below:
+
+```
+  import tvm
+  from tvm import relay
+  from tvm.relay.op.contrib import mrvl
+  from gluoncv import model_zoo, data, utils
+
+  ...
+
+  ssd_resnet50 = model_zoo.get_model("ssd_512_resnet50_v1_voc", pretrained=True)
+  inp_shape = (1, 3, 512, 512)
+  raw_model_ir, weight_bias_params = relay.frontend.from_mxnet(model, {"data": inp_shape})
+
+  # call mrvl.partition_for_mrvl()
+  (model_mrvl, model_other, orig_params, opt_level, disabled_pass, orig_mod,
+      mrvl_layers_in_mrvl_subgraph) = mrvl.partition_for_mrvl(
+      raw_model_ir, params=weight_bias_params, tvm_custom_dict={},
+      gen_non_mrvl_subgraph=False, flow_pass=1)
+
+  # call relay.build() and mrvl.dump_json_meta_data_files()
+  build_target, device_id = "llvm", 0
+  mod_name = relay.backend.utils.mangle_module_name("")
+  byoc_executor = relay.build(model_mrvl, target=build_target, mod_name=mod_name)
+  byoc_const_params = byoc_executor.get_params()
+  byoc_external_graph_json = byoc_executor.get_external_graph_json()
+  nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+      byoc_external_graph_json, byoc_const_params,
+      filename_prefix=f"{model_name}-tvm-mrvl-byoc-ir")
+...
+```
+
+First, we can download a pre-trained SSD-ResNet50 model from the MXNet-gluoncv site; then, call the mrvl.partition\_for\_mrvl() function to trigger the TVM-BYOC-Marvell flow; and finally, call relay.build() function and mrvl.dump\_json\_meta\_data\_files() function to generate a pair of JSON files for each Marvell sub-graph identified by the TVM-BYOC-Marvell flow.

Review comment:
       suggest to use numbered list:
   ```suggestion
   The above code snippet does the following:
   1. Download a pre-trained SSD-ResNet50 model from the MXNet-gluoncv site
   2. Call the `mrvl.partition_for_mrvl()` function to partition the graph into Marvell and non-Marvell pieces and trigger the TVM-BYOC-Marvell flow
   3. Call relay.build() function and mrvl.dump\_json\_meta\_data\_files() function to generate a pair of JSON files for each Marvell sub-graph identified by the TVM-BYOC-Marvell flow.
   ```

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs

Review comment:
       nevermind, i see you are indeed reusing the device partition flow

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,560 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+We follow what the TVM BYOC flow does (e.g., as done by others) to create our TVM-BYOC-Marvell POC code files and flow under the following folders -- refer to the up-loaded appache/tvm-PR-9730 POC for details:
+
+```
+  - cmake/modules/contrib/Mrvl.cmake
+  - python/tvm/relay/op/contrib/mrvl.py
+  - src/relay/backend/contrib/mrvl/codegen.cc, drop_noop_transpose.cc,
+    graph_executor_codegen_mrvl.cc
+  - src/runtime/contrib/mrvl/mrvl_runtime.cc
+  - tests/python/contrib/test_mrvl/__init__.py, infrastructure.py,
+    test_mrvl_codegen.py
+  - plus, other corresponding changes
+```
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-BYOC-Marvell AOT compilation and code-gen flow as illustrated in Figure1 and
+STEPs (1), (2), (3a), (3b), and (4) below.
+
+### Figure 1: TVM-BYOC-Marvell AOT Compilation, Code-gen Flow
+![](./assets/0048/figure1-flow.png)
+
+### STEP (1) Run TVM-BYOC-Marvell AOT ML Frontend Compilation and TVM-BYOC-Marvell code-gen using typical TVM flow.
+
+The main input to STEP (1) is a pre-trained ONNX or MXNet model; and two outputs coming out of STEP (1) include a pair of Nodes-JSON file and Constants-JSON file for each Marvell sub-graph. This pair of JSON files represents the meta-data information of a Marvell sub-graph, which is a part of the given pre-trained model identified by the TVM-BYOC-Marvell flow.
+
+Utilizing up-loaded POC changes in appache/tvm-PR-9730, sample code snippet for STEP (1) is illustrated below:
+
+```
+  import tvm
+  from tvm import relay
+  from tvm.relay.op.contrib import mrvl
+  from gluoncv import model_zoo, data, utils
+
+  ...
+
+  ssd_resnet50 = model_zoo.get_model("ssd_512_resnet50_v1_voc", pretrained=True)
+  inp_shape = (1, 3, 512, 512)
+  raw_model_ir, weight_bias_params = relay.frontend.from_mxnet(model, {"data": inp_shape})
+
+  # call mrvl.partition_for_mrvl()
+  (model_mrvl, model_other, orig_params, opt_level, disabled_pass, orig_mod,
+      mrvl_layers_in_mrvl_subgraph) = mrvl.partition_for_mrvl(
+      raw_model_ir, params=weight_bias_params, tvm_custom_dict={},
+      gen_non_mrvl_subgraph=False, flow_pass=1)
+
+  # call relay.build() and mrvl.dump_json_meta_data_files()
+  build_target, device_id = "llvm", 0
+  mod_name = relay.backend.utils.mangle_module_name("")
+  byoc_executor = relay.build(model_mrvl, target=build_target, mod_name=mod_name)
+  byoc_const_params = byoc_executor.get_params()
+  byoc_external_graph_json = byoc_executor.get_external_graph_json()
+  nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+      byoc_external_graph_json, byoc_const_params,
+      filename_prefix=f"{model_name}-tvm-mrvl-byoc-ir")
+...
+```
+
+First, we can download a pre-trained SSD-ResNet50 model from the MXNet-gluoncv site; then, call the mrvl.partition\_for\_mrvl() function to trigger the TVM-BYOC-Marvell flow; and finally, call relay.build() function and mrvl.dump\_json\_meta\_data\_files() function to generate a pair of JSON files for each Marvell sub-graph identified by the TVM-BYOC-Marvell flow.
+
+We are calling the byoc\_executor.get\_external\_graph\_json() function and the byoc\_executor.get\_params() function in order to generate both Nodes-JSON file and Constants-JSON file, respectively.
+
+* The get\_external\_graph\_json() function is a new addition to Python class BuildModule(object).
+* The get\_params() function exists for Python class BuildModule(object), but to make it work, we need to disable the "removal external params" CPP code block in the CPP class RelayBuildModule.
+
+Sub steps involved in STEP (1) are (refer to Figures 1, 2a, 2b, 3 with descriptions below):
+
+* Load pre-trained network into TVM IR graph.
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator.
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability in the accelerator.
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph.
+* Partition IR graph into one or more for-accelerator Marvell sub-graphs and/or one or more LLVM-non-Marvell sub-graphs (e.g., for running inference on ARMv9):
+
+    * These sub-graphs cover the whole pre-trained network.
+
+    * For-accelerator Marvell sub-graph here means & contains a set of connected, composite-merged/fused Call nodes (i.e., not just one compoiste-merged/fused Call node function).  NOTE: the term sub-graph defined here can be different from existing TVM sub-graph definition.
+
+    * As shown in Figure 2a, a pre-trained CNN ONNX model (on the left) is processed by the TVM-BYOC-Marvell flow into only one Marvell sub-graph (illustrated in the middle of Figure 2a) where operators of given ONNX model are composite-merged/fused into 8 fused composition function in the Marvell sub-graph. For example, near bottom left a set of MatMul + Add + Relu operators of the ONNX model are fused into one tvmgen\_mrvl\_main\_7 composition function in the Marvell sub-graph.
+
+    * As another example in Figure 2b, given the same CNN ONNX model, we can apply a different argument value but this time to ask the TVM-BYOC-Marvell flow, mrvl.partition\_for\_mrvl(...), to identify one Marvell sub-graph of 4 fused composition Call node functions and another LLVM-non-Marvell sub-graph as illustrated in the middle top sub-graph A and in the middle bottom sub-graph B, respectively.  This special argument value can lead to different inference performance in terms of meeting latency, bandwidth, and/or memory requirements.
+
+    * For the first TVM-BYOC-Marvell revision, at most one for-accelerator Marvell sub-graph and at most one LLVM-non-Marvell sub-graph can be identified; plus, the for-accelerator Marvell sub-graph can only use input tensor(s) of given pre-trained network as its sub-graph’s input tensors.
+
+    * Figure 3 illustrate how a complex Marvell sub-graph can look like. The whole sub-graph shown here represents a Marvell sub-graph of more than 100 fused compositions Call node functions and it comes from the pre-trained SSD-ResNet50 MXNet model. The LLVM-non-Marvell sub-graph part of the SSD-ResNet50 model is not displayed here but it contains rest of the object-detection part of the model in order to finalize 2D-BBOXes and labels.
+
+* Do code-gen step for each Marvell sub-graph by producing pair of Nodes-JSON and Constants-JSON files:
+
+    * The TVM-BYOC-Marvell flow also pecifies Marvell attributes for each composite-merged/fused Call node function so that generated Nodes-JSON file(s) and Constants-JSON file(s) can represent the meta-data inforamtion of Marvell sub-graph(s) in order to do post-processing.
+
+    * RFC reviewer feedback: can we identify the Marvell sub-graph by running a TIR-only pass after scheduling (with the potential benefit to also operate on the logical TIR buffers)? Marvell developer can and will spend time on understand the TIR flow and its pass to find out.
+
+![](./assets/0048/figure2a-onnx-1-mrvl-sub-graph-backend-layers.png)
+
+![](./assets/0048/figure2b-onnx-mrvl-sub-graph-A-llvm-sub-graph-B.png)
+
+![](./assets/0048/figure3-sample-mrvl-sub-graph-for-ssd-resnet50.png)
+
+
+### STEP (2) Run Marvell-ML/AI Backend Compiler to generate model binary for each Marvell sub-graph
+
+* As shown in middle left section of Figure 1, labeled as (2), we will execute, outside of the typical TVM flow, the Marvell-ML/AI backend compiler program to post-process Nodes-JSON and Constants-JSON files of each Marvell sub-graph in order to generate final ISA instructions (in a Marvell model binary file) to run inference on Marvell accelerator.
+
+* The Marvell-ML/AI backend compiler program will be distributed as: mrvl-tvmircomp. For example, the command line below can be used to generate the model binary file for a pair of CNN JSON files to run fp16-based inference by utilizing 1M bytes of On-Chip memory on each of 4 HW compute tiles:
+
+```
+  $ mrvl-tvmircomp --model_name cnn --nodes cnn-tvm-mrvl-byoc-ir.json \
+        --consts cnn-tvm-mrvl-byoc-const.json \
+        --arch=MLIP --dram_addr_relocatable=1 --ocm_base=0x0 -ocm_size=0x100000 \
+        --num_tiles=4 --quantize=float16
+
+  note: the output model binary file generated is: cnn.bin
+
+```
+
+* Marvell backend compiler does additional optimizations AOT including to group, allocate, and map layer-based tensors and computes onto pre-allocated resources (such as above: 4 compute tiles and 1M bytes on each of 4 tiles) avaialble on the Marvell accelerator.  Sample layer-based structures used by ISA instructions for the CNN model are illustrated in the right most column in both Figure 2a and Figure 2b.
+
+* Note: Marvell ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will focus only on float16 AOT compilation to run float16 inference.
+
+* Note: Marvell can provide a mrvl-tvmircomp executable to TVM CI environment to run TVM Jenkins build & tests.
+
+
+### STEP (3a) or (3b) Run inference on the Software Simulator or on the Marvell ML/AI HW accelerator for the Marvell sub-graph
+
+* As illustrated in the middle left section of Figure 1, labeled as (3a), a cycle-approximate Marvell Software Simulator, mlModel, which cycle approximately mimics the Marvell ML/AI HW accelerator, will be distributed, The Marvell Software Simulator can be used to read in a Marvell model binary file and its corresponding inference input file(s) to run inference to generate results for the Marvell sub-graph. For example, the command line below can be used to run inference:
+
+```
+  $ mlModel --model_binary cnn.bin --inputs cnn_input/input1.bin --arch=MLIP --perf_debug
+
+  note1: the inference output will be saved at: cnn-output.bin
+  note2: optionally, cycle level information for performance debug can also dump
+
+```
+
+* Note: Marvell can provide a mlModel executable to TVM CI environment to run TVM Jenkins build & tests.
+
+* Also as illustrated on the right side of Figure 1, labeled as (3b), tools, driver and firmware are available such that they can be used to run inference on an Marvell ML/AI inference HW accelerator.
+
+
+### STEP (4) Use TVM-LLVM Compiler & Runtime to run inference for the LLVM-non-Marvell sub-graph
+
+* As illustrated in the bottom left section of Figure 1, labeled as (4), an integration step between sub-graph(s) need to be done at inference runtime in order to run full inference for the given pre-trained model. We can use TVM-LLVM flow to generate runtime .so binary for each LLVM-non-Marvell sub-graph.  POC code for STEP (4) is not yet ready (WIP) and is not included in the uploaded appache/tvm-PR-9730.
+
+* For the first BYOC-Marvell revision, at most one integration step from a for-accelerator Marvell sub-graph to a LLVM-non-Marvell sub-graph is implemented.
+
+### Exercise TVM-BYOC-Marvell flow
+
+To exercise the TVM-BYOC-Marvell flow, we have provided a tests/python/contrib/test\_mrvl folder with test\_mrvl\_codegen.py and infrastructure.py files so that they shows how to exercise the TVM-BYOC-Marvell flow for a pre-trained SSD-ResNet50 model.  In addition, Marvell are also planning to provide the Marvell backend compiler (mrvl-tvmircomp) and the Marvell HW accelerator software simulator (mlModel) so that they can be used to read in JSON files generated by the TVM-BYOC-Marvell flow to run inference to get results.
+
+In the uploaded appache/tvm-PR-9730 branch,
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+### Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration). Please also refer to files of the uploaded appache/tvm-PR-9730 for details.
+
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train\_images dataset and save the pre-trained model in ONNX (say, mnist\_fashion.onnx). Then, we can run BYOC Marvell flow by giving any image of the orig\_test\_images[i] dataset to get its inference fashion label and item name in top\_label\_id and fashion\_label\_dictionary[top\_label\_id], respectively. In addition, we can also use the corresponding golden label, golden\_output\_labels[i], to validate the inference result.
+
+```
+  (train_images, train_labels), (
+      orig_test_images,
+      golden_output_labels,
+  ) = keras.datasets.fashion_mnist.load_data()
+```
+
+In the code snippet below, we call onnx.load() and relay.frontend.from\_onnx() to generate TVM mod and params. Then, they are used by the mrvl.partition\_for\_mrvl() function and the mrvl.dump\_json\_meta\_data\_files() function provided for the TVM-BYOC-Marvell flow to generate Nodes-JSON file (nodes\_json\_filename) and Constants-JSON file (consts\_json\_filename).

Review comment:
       in the PoC PR, `partition_for_mrvl` is registered in python/tvm/driver/tvmc/composite_target.py along with the other BYOC partitioners, but its signature differs significantly (from the de-facto `partition_func(IRModule) -> IRModule`):
   ```
       """Partition the graph greedily offloading supported
       operators to Mrvl
   
       Parameters
       ----------
       mod : Module
           The module to run passes on.
       params : Optional[Dict[str, NDArray]]
           Constant input parameters.
   
       Returns
       -------
       mod_mrvl : annotated and partitioned module - part 1, the mrvl sub graph
       mod_other : annotated and partitioned module - part 2, if any, the rest sub graph
       params : TBA
       opt_level : TBA
       disabled_pass_list : TBA
       mod : TBA
       mrvl_layers_in_mrvl_subgraph : TBA
       """
   ```
   
   what's your intention here?  in order to register this function in `REGISTERED_CODEGEN`, you'll need to make that signature match up. however, i think from my reading, what's happening here is you're invoking a fair bit of the compilation pipeline underneath a hard-coded PassContext, then returning a fair bit of extra information here. some of this information looks fairly specific to the Marvell lowering flow.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] mbs-octoml commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
mbs-octoml commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r787217701



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability

Review comment:
       Hi, thanks for the RFC. My team at OctoML is looking at bringing some training features to the BYOC world (a la https://arxiv.org/pdf/2111.00655.pdf), so I'm looking at this RFC with that future in mind. Can you expand on:
    - Is the fusion using the existing MergeComposite / AnnotateTarget/ MergeCompilerRegions(maybe) / PartitionGraph sequence?
    - Other than the global layout xform, which necessarily must be done before any fusion etc, are there any other xforms before the above partitioning takes place?
    - Can you explain the need to limit to one kernel for each of your byoc and the default tvm? Perhaps it's an artifact of how you're later trying to capture the byoc output in json graph form? Ideally the BYOC target.ext.name function could be run multiple times, the resulting runtime::Module would be accumulated in the IRModule, and the runtime::Modules later merged. Perhaps supporting that would actually be easier and would remove the at-most-one kernel limit?
    - Ideally there'd be a single entry point for 'partition for marvel', after which the regular TVM build would deal with fusion, lowering and codegen for everything that's left (ie overall model - kernels you already partitioned out). I may not be following the explanation but it seems you're proposing the driver splits things more explicitly.
    - Like @areusch  I'm a bit confused by the special handling of the graph. Perhaps it would be worth going through the tensorrt BYOC integration as a reference example since it too collects a JSON representation of the to-be-complied fused sub-graph (we invoke the TensorRT build function at runtime not compile time), but it does so on top of existing machinery. 
   
   Let me know if it would be easier to discuss this on a PR rather than here, then we could come back to here.   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r787262754



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability

Review comment:
       Let me raise a difference here:
   
     *   The TVM partition’s sub-graph seems to represent a relay function, which can include multiple frontend operators captured by utilizing the relay merge-composite pattern
     *   The Marvell sub-graph is a connected graph of multiple relay merge-composite functions – I did not know how to include a Figure in the RFC file before (now I do). But if you look at the listed pre-RFC link, we did include figures at end of the corresponding pre-RFC on the discuss forum – please check the end of pre-RFC and its figure to see whether they can help explaining the definition of Marvell sub-graphs here. https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691].
    
   We have also up-steamed the TVM GitHub’s PR-9730 as a POC (can be downloaded via git clone https://github.com/ccjoechou/tvm.git and changes are on the byoc-mrvl branch). Please see the tvm/python/tvm/relay/op/contrib/mrvl.py file's partition_for_mrvl() function's seq setup there.
   There is also the test_mrvl suite, which can be run to generate JSON files for ssd-resnet50 network.
   
   [Using our definition of sub-graph -- not the TVM partition's definition of sub-graph] Yes, limitation regarding at-most one mrvl-sub-graph and at most-one llvm sub-graph can be relaxed later on when we have runtime & driver hookups ready + our driver & firmware of our HW accelerator are also ready to handle multiple sub-graphs. We will be spending time on this area in the next few months.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r787267709



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability

Review comment:
       @mbs-octoml: Thanks for replying. Please also see my comments to @areusch's reply below including several in-line write-ups since they can be providing information regarding to your questions too. Please let me know if anything can be clarified on the TVM GitHub PR-9730 front.
   Currently, we are also running parts of the tvm/Jenkinsfile stages and their steps locally using our own Jenkins server. However, we are having problem to debug rust/cargo issue (the tvm/scripts/task_rust.sh suite). It will be great, if you can provide us additional information regarding how to build our "local" tvm-build package (I can git clone current OctoML GitHub tvm-build repo) and then how we can adjust the tvm/rust/Cargo.toml file to use our "local" tvm-build package.
   Also, any tips and pointers regarding how to debug rust/cargo build.
   Thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ssdurako commented on pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ssdurako commented on pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#issuecomment-1015682283


   @areusch @jroesch 
   Is there some aging process to follow in the TVM Open Source so issues get looked at and advanced without getting stale for weeks? As there are timelines that get affected by such delays. 
   
   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r797198719



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
+    * Some common classes have been moved from the original src/relay/backend/graph_executor_codegen.cc file to the
+      new src/relay/backend/graph_executor_codegen.h file so that they can be shared by Marvell-specific functions
+      and derived classes defined in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+
+    * new definitions are listed below:
+```
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.h file
+    /*! \brief Node types */
+    enum GraphNodeType {
+      kGraphNop,
+      kGraphInputNode,
+      kGraphOpNode,
+      kGraphInputNodeExt,
+      kGraphOpNodeExt,
+    };
+
+    
+    class ExternalJsonWriterCB {
+     public:
+      template <class T>
+      void RegisterCB(T* const object,
+                      void (T::*const mf)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                          std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        using namespace std::placeholders;
+        callback_ = std::bind(mf, object, _1, _2, _3, _4);
+        hasCallback_ = true;
+      }
+      void RegisterCB(void (*const fun)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                        std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        callback_ = fun;
+        hasCallback_ = true;
+      }
+      void Exe(dmlc::JSONWriter* external_writer, Array<tvm::runtime::Module> mod,
+               std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads) {
+        ICHECK(hasCallback_) << "ERROR: no registered callback";
+        callback_(external_writer, mod, nodes, heads);
+      }
+      inline bool HasCallback() { return hasCallback_; }
+
+     private:
+      std::function<void(dmlc::JSONWriter*, Array<tvm::runtime::Module>, std::vector<GraphObjectPtr>,
+                         std::vector<GraphNodeRef>)>
+          callback_;
+      bool hasCallback_{false};
+    };
+
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.cc file
+    class GraphExecutorCodegen : public backend::MemoizedExprTranslator<std::vector<GraphNodeRef>> {
+     public:
+      GraphExecutorCodegen(runtime::Module* mod, const TargetMap& targets)
+          : mod_(mod), targets_(targets) {
+        // we need the following variable to be a static member of the class so we can access
+        //   its setting in the following static GetExternalJsonWriter() function; but this static
+        //   member can actually be used as a local Callback setting for "per" GraphExecutorCodegen
+        //   instantiation during each TVM build-codegen flow
+        external_json_writer_ = std::make_shared<ExternalJsonWriterCB>();
+        ICHECK(external_json_writer_);
+      }
+      static ExternalJsonWriterCB* GetExternalJsonWriter() { return external_json_writer_.get(); }
+      ....
+      LoweredOutput Codegen(IRModule mod, relay::Function func, String mod_name) {
+        ....
+
+        // if it has been registered for this GraphExecutorCodegen object, call the external JSON writer
+        if (external_json_writer_->HasCallback()) {
+          std::ostringstream external_os;
+          dmlc::JSONWriter external_writer(&external_os);
+          external_json_writer_->Exe(&external_writer, ret.external_mods, nodes_, heads_);
+          ret.external_graph_json = external_os.str();
+        }
+
+        return ret;
+      }
+    };
+
+    extern "C" ExternalJsonWriterCB* GetExternalJsonWriter() {
+      return GraphExecutorCodegen::GetExternalJsonWriter();
+    }
+
+    /////////////
+    // in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+    // Marvell-specific extentions
+    class GraphInputNodeMrvlExt : public GraphInputNode {
+        ...
+        GraphNodeType Type() const override { return kGraphInputNodeExt; }
+        void Save(dmlc::JSONWriter* writer) const override { /* extensions */ }
+    }
+
+    class GraphOpNodeMrvlExt : public GraphOpNode {
+        ...
+        GraphNodeType Type() const override { return kGraphOpNodeExt; }
+        void Load(dmlc::JSONReader* reader) override;
+        void LoadAttrs(dmlc::JSONReader* reader);
+        std::pair<std::string, GraphAttrs> GetLoadedGraphAttrs();
+    }
+
+    class MrvlExtJson {
+     public:
+      MrvlExtJson() {
+        ICHECK(!GetExternalJsonWriter()->HasCallback()) << "ERROR: has registered callback";
+        GetExternalJsonWriter()->RegisterCB(this, &MrvlExtJson::GetExternalJSON);
+      }
+      virtual ~MrvlExtJson() {}
+      void GetExternalJSON(dmlc::JSONWriter* writer, Array<tvm::runtime::Module> external_mods,
+                           std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads);
+      void LoadExternalJsonAttrs(std::unordered_map<std::string, GraphAttrs>* external_attrs_map,
+                                 const Array<tvm::runtime::Module>& external_mods);
+    };
+```
+
+* the need to link between pre-trained model and final Marvell backend layer - for instance, through tvm_custom
+    * We did not include prototype code in PR-9730 but intend to provide our sample changes in another RFC and PR.
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+* We haven't identified any major *not* do items. Several other designs are by choices - that is we understand that
+  there are benefits for doing or benefits for not-doing.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+* We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.
+
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+* We are following the existing TVM BYOC framework and example files.
+    * for example: to do IR compositions, to define own IR passes, to mix implementations in Python/C++, and etc.
+
+* We have extended graph_executor_codegen.cc and JSON loader/saver in order to read and write out Marvell specific
+  attributes
+
+* Currently, we haven't spend enough time to under how tvm/rust/cargo requirements and steps. Therefore, we are
+  bypassing the tvm/Jenkinsfile's tests/scripts/task_rust.sh step. We will need help to re-enable the step.
+
+* We like to duplicate the Jenkins environment in order to run tvm/Jenkinsfile as is, but, we ran into many issues.
+  Currently, we have a tvm-like Jenksinsfile environment to only run a subset of test suites using a modified
+  Jenkinsfile.
+
+* We have identified a need to allow a call-back function to be registered when generating Mrvl-BYOC-specific
+  Nodes-JSON file. We are trying to follow TVM Python/CPP-CB style as much as possible. But, since our callback
+  function tvm/src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc::GetExternalJSON() function is using
+  non-simple argument types, we need help from TVM community to provide suggestions/guidelines in order to make
+  new CB code better to meet TVM community requirements here.
+
+* For one Mrvl-BYOC relay transformation pass, we have identified a need to inject a (global) expr node ID for the
+  RelayExprNode class and its derived classes: Tuple and CallNode, so that during the transformation pass, we can
+  uniquely identify each Tuple or CallNode object. Again, we need help from TVM community to provide
+  suggestions/guidelines here in order to know whether this is one of the best ways to achieve the Mrvl-BYOC need.
+
+* We also identified a need to maintain linkages between (operator-)information described in the original, given
+  pre-trained network model and the code-gen JSON files so that the compiler backend will be able to report user-level
+  (e.g., meaningful-to-user) messages regarding the given pre-trained network. For instance, in the
+  tvm/python/tvm/relay/frontend/onnx.py and common.py files, we can see user-level information being captured using
+  “tvm_custom” related code as in original onnx.py file for the given pre-trained network; but, in common.py, the code
+  later drops the linkage, via attrs.pop(“tvm_custom”), and does not pass the linkage onto the initial relay IR graph.
+  We have a draft solution to maintain linkages between the given pre-trained network model and its relay IR graph
+  (using expr node ID and tvm custom ID, plus, a few utility functions), but would like to know whether the TVM
+  community has any better or work-in-progress resolution.
+
+* When using TVM RPC code to exercise and run inference on a remote-hosted Mrvl ML/AI HW accelerator for the Mrvl
+  subgraph, we ran into one minor issue and have made local TVM RPC enhancement so that, when a TVM RPC client sends

Review comment:
       BTW, in our use case, we need the server path to be know on the client side so that the client is the master who controls activities to be running on the server side.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r797193790



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:

Review comment:
       yes




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#issuecomment-1033074928


   @areusch: You are correct that I must did something wrong the last time and now I lost the linkage between my GitHub forked byoc-mrvl branch and this tvm-rfc PR-#48. Therefore, my changes done a week ago are still staying on my personal GitHub forked byoc-mrvl branch and did not get push to the tvm-rfc PR-#48.
   
   Any suggestion for me to do?
   I can see my byoc-mrvl branch is good to be merged automatically with the tvm-rfc PR-#48 but I do not see any bottom to click to make it happen.
   
   ![image](https://user-images.githubusercontent.com/54378300/153077895-c8ea2a7e-0d90-43fe-b634-9076cad3b8ec.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] areusch commented on pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
areusch commented on pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#issuecomment-1040919020


   @ccjoechou sorry for the delay--i've gotten pretty busy with something and will hopefully have some bandwidth towards the end of the week.
   
   cc @jroesch @mbs-octoml in case they have cycles


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] areusch commented on pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
areusch commented on pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#issuecomment-1056363210


   @ccjoechou Summarizing our discussion a bit:
   
   - Marvell is interested in being able to arbitrarily partition a Relay graph into hardware-accelerated and non-hardware-acclerated parts
       - The boundaries between these parts are to be determined by Marvell backend; therefore, some additional control is needed over the default behavior provided by MergeComposite
       - @mbs-octoml suggests that they use the [StopFusion annotation](https://github.com/apache/tvm/blob/main/src/relay/op/annotation/annotation.h#L38) to manually enforce the boundaries. These annotations could be added programmatically via a Relay IRModule pass. StopFusion is used in [FuseOps pass](https://github.com/apache/tvm/blob/main/src/relay/transforms/fuse_ops.cc#L896) to avoid fusion.
       - Using this approach, the Marvell partitioning pass defined here could be simplified and the existing fusion pass could be used.
   - Marvell needs to be able to determine which:
       - Imported ONNX operator is responsible for a given Relay node
       - Relay node is responsible for a TIR CallNode
       
       This needs to happen at two times:
       
       1. At compile time, to serve as a reference to the boundary nodes between a hardware-accelerated and non-hardware-accelerated subgraph
       2. At runtime, to determine which backend operator to call
       
       A follow-up question here from me: at runtime, couldn’t you just emit the call to the correct backend operator? I wonder if the reason this mapping was needed was due to previous difficulties configuring the TVM partitioner (it would sometimes fuse across a desired boundary). Is it possible to avoid the need for this reference at runtime given the improved partitioning approach mentioned above?
       
       That doesn't solve the problem of needing to identify a Relay node at compile time. However, if we can reduce this need to a purely compile-time need, perhaps we can develop an alternate way to refer to a Relay node given Relay source code other than adding an id to the IR. cc @tqchen @junrushao1994 in case they have ideas here.
   
       - Marvell proposes to add a Relay attribute exprnode_id and export this from the compiled artifact to identify the relay nodes which are fused into a particular subgraph
       - More broadly, source maps (e.g. mapping TIR to Relay to frontend operator) would help here.
   - Right now the RFC proposes to create a new GraphExecutorCodegen. It might not be necessary to do this if we could export the exprnode_id for Relay operators passed to BYOC. A suggestion is to create a Marvell-specific runtime::Module modeled after [CUDAModule](https://github.com/apache/tvm/blob/main/src/runtime/cuda/cuda_module.cc#L137) which contains several distinct pieces of generated code. The exprnode_ids could be kept separate from any binary instructions if encoded this way. This pattern is common amongst GPU-offloaded runtime::Module.
       - Additionally, note the [SaveToFile](https://github.com/apache/tvm/blob/main/src/runtime/cuda/cuda_module.cc#L70) override which is invoked when `Module.save()` is called from Python. This can allow you walk the runtime::Module tree from Python and collect the various exprnode_ids into a single e.g. JSON blob.
   - @jroesch to comment on rust CI failures
   - Marvell would like to contribute a simulator which can run in TVM CI to test their accelerator. We discussed either adding the sim to ci-cpu or a new ci-marvell, the method to do this, and limitations of TVM CI.
   - Marvell runs a patched version of the TVM CI internally. A primary reason why patching is needed is because many tests in the TVM CI require an internet connection to e.g. download models, but their CI is run in a sandbox. It would be particularly helpful to mark such tests e.g. via pytest.mark in order to make these easy to skip. We also discussed pre-populating the download_testdata cache and patching pytest.skip into download_testdata on their internal fork. cc @leandron @driazati @konturn for visibility and in case they have ideas here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#issuecomment-1011255765


   Hello,
   
   How can we request a reviewer to review our RFC?
   Also, we have followed the RFC template and listed several unresolved questions, which need help from reviewer and/or TVM community to resolve?
   The first few issues are related TVM Jenkins build's rust/cargo failure. I.e., we like to add two more build config flags (use_mrvl and use_mrvl_runtime) in tvm/rust/ setups but need help to update the cargo.io's tvm-build package -- BTW tvm-build seems to be owned by OctoML GitHub. Also, there are other tvm/tests/scripts/task_rush.sh errors. Can a reviewer provide additional information or pointers explain how we should debug the task_rush.sh run to find and resolve issues due to new changes?
   
   Thanks,
   
   - Joe 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r787275962



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run

Review comment:
       yes




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#issuecomment-1015958031


   @areusch - Thanks for replying.
   Please see my comments below regarding your questions.
   
   
     1.  Question: device planning, which I think maybe you're doing outside the typical TVM flow?
        *   If I understand your question correctly: Yes and no.
        *   We are using TVM relay flow to generate JSON meta files and IR sub-graphs of the network model
        *   We like to use TVM code-gen and runtime flow to generate binary to run inference for “llvm-part” of the network model
        *   But we also like to use our build-from-outside-TVM-flow, Marvell accelerator backend/code-gen component to generate binary for “Marvell-part” of the network model (to be run on Marvell accelerator)
                                                                  i.      Not right now, but in a future RFC, we can and like to provide APIs and library files so that we can embed Marvell backend/code-gen component into libtvm.so and within the typical TVM flow
   
        *   Not right now, but in a future RFC, when Marvell driver APIs and TVM-Marvell runtime & driver hookups are ready, we like to use the typical TVM flow (with Marvell modifications) to run “Marvell-part” computes of the network on Marvell HW accelerator directly and llvm-part of computes
     1.  Question: executor, which I think you may have re-implemented here.
        *   I believe that we implemented specializations of the current executor code in order to generate Marvell JSON meta files.
        *   It is possible that others may have also implemented “parts of” similar specializations in the last 6 months – if this is the case (and we can use them), we like to know how we can merge codebase
     2.  Question: to provide code links into your PoC if that would help me understand--I can do some targeted reading
        *   As listed in the RFC, our POC changes have been up-streamed to the TVM GitHub’s PR-9730.
   If you like to, you should also be able to git clone from https://github.com/ccjoechou/tvm.git and checkout & use the “byoc-mrvl” branch.
     3.  Question: it would also be great to spell out a plan for tests here--it seems like it might be possible to checkin your compiler/simulator into our CI, but could you be more explicit about your plans there?
        *   We have added infrastructure code and a test_mrvl suite to run the POC TVM-BYOC-Marvell flow
        *   Currently, there is a code-gen test, which can be run to use a pre-trained a ssd-resnet50 model - please see tvm/tests/python/contrib/test_mrvl/ test_mrvl_codegen.py and its test_ssd_resnet50_aot_json_codegen function
        *   Should also be able to run regular docker steps below to exercise the BYOC-Marvell flow to compile a ssd-resnet50 network to generate JSON meta files for Marvell accelerator:
   
   
     *   ./docker/bash.sh --name tvm_mrvl tlcpack/ci-cpu:v0.79 ./tests/scripts/task_config_build_cpu.sh
     *   ./docker/bash.sh --name tvm_mrvl tlcpack/ci-cpu:v0.79 ./tests/scripts/task_build.sh build -j10
     *   ./docker/bash.sh --name tvm_mrvl tlcpack/ci-cpu:v0.79 ./tests/scripts/task_ci_setup.sh
     *   ./docker/bash.sh --name tvm_mrvl tlcpack/ci-cpu:v0.79 ./tests/scripts/task_python_integration.sh
   
   
        *   For the last task_python_integration.sh suite, can edit the file in order to skip steps to run other test suites but focus on running only tests/python/contrib/test_mrvl:
   sudo pip3 install gluoncv
   run_pytest ctypes ${TVM_INTEGRATION_TESTSUITE_NAME}-contrib tests/python/contrib/test_mrvl
   
   Pleas also see my comments in-line below.
   Let me raise a difference here:
   
     *   The TVM partition’s sub-graph seems to represent a relay function, which can include multiple frontend operators captured by utilizing the relay merge-composite patten
     *   The Marvell sub-graph is a connected graph of multiple relay merge-composite functions – I did not know how to include a Figure in the RFC file before (now I do). But if you look at the listed pre-RFC link, we did include figures at end of the corresponding pre-RFC on the discuss forum – please check the end of pre-RFC and its figure to see whether they can help explaining the definition of Marvell sub-graphs here. https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691].
   
   Thanks again and please let us know, if you like to discuss more.
   
   
     *   Joe
   
   
   
   
   
   From: Andrew Reusch ***@***.***>
   Sent: Tuesday, January 18, 2022 12:48 PM
   To: apache/tvm-rfcs ***@***.***>
   Cc: Joe Chou ***@***.***>; Mention ***@***.***>
   Subject: [EXT] Re: [apache/tvm-rfcs] [RFC][BYOC] Marvell ML/AI Accelerator Integration (PR #48)
   
   External Email
   ________________________________
   
   @areusch requested changes on this pull request.
   
   @ssdurako<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ssdurako&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=d2Phi5OhJYC30chQlDSTXGdBUHLcKOytTeAstHx3XVU&e=> @ccjoechou<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ccjoechou&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=WZdEJI-EoLFqhJJcO9YZeCcSYSSC-s48_LgSB5F5K1c&e=> apologies for the long delay! i think we missed this one since it was mailed during TVMCon and also just before we all took off for the holidays. I'll try to be a bit better about reviewing this.
   
   Overall I have some understanding of your approach with this RFC. I'd like to further discuss some of the rationale behind:
   
     *   device planning, which I think maybe you're doing outside the typical TVM flow?
     *   executor, which I think you may have re-implemented here.
   
   I'm a bit low on bandwidth to read your full PoC PR. would you mind clarifying the RFC as a starting point (or feel free to provide code links into your PoC if that would help me understand--I can do some targeted reading, I'm just fairly busy for a full read-through right now)
   
   it would also be great to spell out a plan for tests here--it seems like it might be possible to checkin your compiler/simulator into our CI, but could you be more explicit about your plans there?
   
   also cc @comaniac<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_comaniac&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=5likf9V3g5a6XmvwPAr-ais9kJ9tL1Oe3UCcdIReMIQ&e=> @mbs-octoml<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mbs-2Doctoml&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=e3YsFOFCwaZu1YZYwMB9BPZ3m6b0Ne-2byTRo1cYgvQ&e=> @Mousius<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Mousius&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=mqoZGl7ZzAvBheq3qBAww4oxaItbLwRtTPxO_jkI7w4&e=> @junrushao1994<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_junrushao1994&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_
 xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=MuKxC3uDBMZAx2b6eewBCqm_vk3zVN29bCKxGNYf8a0&e=> for further comments on BYOC, device planning, and support for custom executors
   
   ________________________________
   
   In rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_tvm-2Drfcs_pull_48-23discussion-5Fr787106344&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=R7fm2C2KsgY7__XRfP-OLUzHCTDIV2vV74gcKzdXtns&e=>:
   
   > +      conv2d + add + batch_norm + tuple.getitem(0) + relu
   
   +    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
   
   +      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
   
   +      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
   
   +
   
   +* Do code-gen step for each for-accelerator Mrvl subgraph:
   
   +    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
   
   +      file and a Constants-JSON file are produced for the Mrvl subgraph
   
   +
   
   +STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
   
   +
   
   +* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
   
   +  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
   
   +  in model binary file
   
   +
   
   +* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
   
   what's the test plan for this RFC? Would it be possible to add the Marvell backend compiler and simulator to our ci images and run against it in CI?
   [ccjoechou writes: for this BYOC-Marvell RFC, the POC PR codebase only contains code to generate JSON meta files. We have up-streamed our test_mrvl test suite but only contains JSON codegen. In our next RFC, we will provide runtime & driver hookups. We are working on a Marvell backend package with Marvell backend code-gen and Marvell software simulator, which mimics a cycle-approximate Marvell HW accelerator. This package can become available later for external usage.
   Currently, we are having problems run TVM rust/cargo and can’t find useful document to debug issues – plus, tvm-build is owned by OctoML (not GitHub TVM, right?)]
   
   ________________________________
   
   In rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_tvm-2Drfcs_pull_48-23discussion-5Fr787107277&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=dTFGJaoaEmF7zEr2T3eNCq4rM8GuJUGIcBCKJwUjp7I&e=>:
   
   > +STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
   
   +
   
   +* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
   
   +  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
   
   +  in model binary file
   
   +
   
   +* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
   
   +  to upstream
   
   +
   
   +STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
   
   +
   
   +* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
   
   +  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
   
   +
   
   +* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
   
   +  focus only on float16 inference run
   
   just checking if this was the end of the sentence here
   [ccjoechou writes: yes]
   
   ________________________________
   
   In rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_tvm-2Drfcs_pull_48-23discussion-5Fr787114611&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=ENAAXEINGM7EwToNOLBW3SyS_1du7HQF34XpbsKcVwc&e=>:
   
   > +          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
   
   +      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
   
   +      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
   
   +          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
   
   +      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
   
   +      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
   
   +          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
   
   +      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
   
   +      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
   
   +}
   
   +
   
   +```
   
   +
   
   +* We can get to the following one Mrvl subgraph by applying the default strategy.
   
   +    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
   
   +      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
   
   could you clarify this sentence?
   
   [ccjoechou writes: did not know how to include a Figure in the RFC file – but I did include figures at end of the corresponding pre-RFC on the discuss forum – please check the end of pre-RFC and its figure to see whether they can help explaining the definition of Marvell sub-graphs here. https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691]
   
   
   
   ________________________________
   
   In rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_tvm-2Drfcs_pull_48-23discussion-5Fr787115633&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=NMd2x1139DrgTciUNgZ8CD9PiwmazbEys_2ZMegmtEU&e=>:
   
   > +      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
   
   +      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
   
   +          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
   
   +      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
   
   +      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
   
   +}
   
   +
   
   +```
   
   +
   
   +* We can get to the following one Mrvl subgraph by applying the default strategy.
   
   +    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
   
   +      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
   
   +
   
   +```
   
   +    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
   
   +      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
   
   above, the RFC discusses having exactly one Marvell and non-Marvell subcgraph, but here I see 8 different function calls. do you mean that there are two targets, and you partition the graph into 8 subgraphs, but each subgraph is assigned to one or the other target? (reading further, I can see this is not the case, but it would help with reader comprehension to clarify this example)
   
   [ccjoechou writes: We are talking about different definitions of “(sub-)graphs” here. In the TVM partition pass, TVM’s graph or sub-graph is a merge-composite IR function, which can contain a pre-define pattern of original frontend operators. In BYOC-Marvell RFC’s definition, a sub-graph is a connected graph of Marvell-merge-composite functions. For instance, a tvmgen_mrvl_main_4 (see below in original email), it is a TVM-partition sub-graph, which is a Marvell merge-composite function containing frontend operators: conv, add, batchnorm, tuple-get-item, relu. But a Marvell sub-graph contains, in the given test case, several Marvell merge-composite functions.]
   
   ________________________________
   
   In rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_tvm-2Drfcs_pull_48-23discussion-5Fr787117464&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=9V5lRxYmfESbgL3yM5Ilb_xMOPZt-l7-zSzW5591Rko&e=>:
   
   > +      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
   
   +      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
   
   +      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
   
   +      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
   
   +      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
   
   +    }
   
   +```
   
   +
   
   +* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
   
   +    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
   
   +    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
   
   +      original IR nodes into Marvell backend layers.
   
   +    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
   
   +      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
   
   +```
   
   +      # from original IR graphs
   
   this process looks rather similar to the device planning pass used in tvm.relay.build. are they the same? if not, could you motivate why you don't want to reuse that one?
   [ccjoechou: sorry I am not sure what you meant by “device planning pass”? We have been following what others did in tvm/python/tvm/relay/op/contrib by utilizing relay passes (for example, ConvertLayout, MergeComposite, AnnotateTarget, and etc.). Please note that in this RFC, we only want to generate JSON meta files and we are not ready to propose/up-stream our runtime & driver hookups yet).]
   
   ________________________________
   
   In rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_tvm-2Drfcs_pull_48-23discussion-5Fr787129775&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=mipKEVzGwI-kUn_2n_Pb71FU6tS4AsBlfYeJzyoNlJg&e=>:
   
   > +
   
   +      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
   
   +      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
   
   +      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
   
   +      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
   
   +      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
   
   +
   
   +      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
   
   +          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
   
   +        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
   
   +            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
   
   +      }
   
   +```
   
   +
   
   +* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
   
   +  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
   
   could you motivate the naming of en_id a bit? i recognize this is a common thing, but it might be nice to choose a slightly more specific name
   [ccjoechou writes: en_id as ExprNode ID. It is an extra field, which has been defined in the include/./tvm/ir/expr.h file for the RelayExprNode or just ExprNode class.]
   
   ________________________________
   
   In rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_tvm-2Drfcs_pull_48-23discussion-5Fr787130884&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=Ew6gvJZix-z7H4QsP5VlpdhjQ4fJ-0-iezoY2L2FyAk&e=>:
   
   > +            mod = seq(mod)
   
   +        return mod
   
   +
   
   +    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
   
   +    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
   
   +    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
   
   +    mod_new = relay.transform.InferType()(mod_new)
   
   +    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
   
   +    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
   
   +    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
   
   +    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
   
   +    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
   
   +    return mod_new
   
   +```
   
   +
   
   +* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
   
   could you motivate this further? it's hard to understand why you need to output your own JSON format without some explanation here.
   [ccjoechou writes: the above code block is not to output our own JSON format; instead, it is to “revert” a sub-graph, which went over Marvell passes (e.g., ConvertLayout, MergeComposite, AnnotateTarget, and etc), back to its original, say, llvm-IR graph. Hence, we are: defuse-ops (opposing to Merge-Composite), reverted ConvertLayout, and etc. Motivation for reverting back to this llvm-part subgraph is to allow this llvm-part subgraph to go through TVM llvm-flow to generate runtime binary.]
   
   ________________________________
   
   In rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_tvm-2Drfcs_pull_48-23discussion-5Fr787131656&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=rToD4fZ9EIg8eIkLKttKsDNIv0v45-SouZi0No4Qt5g&e=>:
   
   > +```
   
   +
   
   +* the need to link between pre-trained model and final Marvell backend layer - for instance, through tvm_custom
   
   +    * We did not include prototype code in PR-9730 but intend to provide our sample changes in another RFC and PR.
   
   +
   
   +
   
   +# Drawbacks
   
   +[drawbacks]: #drawbacks
   
   +
   
   +* We haven't identified any major *not* do items. Several other designs are by choices - that is we understand that
   
   +  there are benefits for doing or benefits for not-doing.
   
   +
   
   +# Rationale and alternatives
   
   +[rationale-and-alternatives]: #rationale-and-alternatives
   
   +
   
   +* We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.
   
   it seems like there has been some impact to the GraphExecutor, and I think one point of confusion here is whether it was necessary to do that or whether you could have handled the additional runtime complexity inside a Marvell-specific runtime.Module. could you explain a bit further here?
   
   [ccjoechou writes: I do not see GraphExecutor term in above. Please provide an example or point us to a TVM file so we can understand your comment a bit more. Thanks.]
   
   ________________________________
   
   In rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_tvm-2Drfcs_pull_48-23discussion-5Fr787132217&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=JRaYKexNQLso0O88polwIXQyGMTNHhdyLbV1bKRfSrM&e=>:
   
   > +  bypassing the tvm/Jenkinsfile's tests/scripts/task_rust.sh step. We will need help to re-enable the step.
   
   +
   
   +* We like to duplicate the Jenkins environment in order to run tvm/Jenkinsfile as is, but, we ran into many issues.
   
   +  Currently, we have a tvm-like Jenksinsfile environment to only run a subset of test suites using a modified
   
   +  Jenkinsfile.
   
   +
   
   +* We have identified a need to allow a call-back function to be registered when generating Mrvl-BYOC-specific
   
   +  Nodes-JSON file. We are trying to follow TVM Python/CPP-CB style as much as possible. But, since our callback
   
   +  function tvm/src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc::GetExternalJSON() function is using
   
   +  non-simple argument types, we need help from TVM community to provide suggestions/guidelines in order to make
   
   +  new CB code better to meet TVM community requirements here.
   
   +
   
   +* For one Mrvl-BYOC relay transformation pass, we have identified a need to inject a (global) expr node ID for the
   
   +  RelayExprNode class and its derived classes: Tuple and CallNode, so that during the transformation pass, we can
   
   +  uniquely identify each Tuple or CallNode object. Again, we need help from TVM community to provide
   
   +  suggestions/guidelines here in order to know whether this is one of the best ways to achieve the Mrvl-BYOC need.
   
   i think it would help to spell out why you guys need to be able to identify each expression here.
   
   [ccjoechou writes: yes but not just us but a data-scientist customer who is using TVM flow may like to know, for example, the linkages between the runtime performance #s and their corresponding user frontend model’s operators (e.g., each expression).]
   
   ________________________________
   
   In rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_tvm-2Drfcs_pull_48-23discussion-5Fr787132545&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=dIcIPdSahtIX9piD7O2f_UhKljEWi-K6XvL8MI3svgY&e=>:
   
   > +  RelayExprNode class and its derived classes: Tuple and CallNode, so that during the transformation pass, we can
   
   +  uniquely identify each Tuple or CallNode object. Again, we need help from TVM community to provide
   
   +  suggestions/guidelines here in order to know whether this is one of the best ways to achieve the Mrvl-BYOC need.
   
   +
   
   +* We also identified a need to maintain linkages between (operator-)information described in the original, given
   
   +  pre-trained network model and the code-gen JSON files so that the compiler backend will be able to report user-level
   
   +  (e.g., meaningful-to-user) messages regarding the given pre-trained network. For instance, in the
   
   +  tvm/python/tvm/relay/frontend/onnx.py and common.py files, we can see user-level information being captured using
   
   +  “tvm_custom” related code as in original onnx.py file for the given pre-trained network; but, in common.py, the code
   
   +  later drops the linkage, via attrs.pop(“tvm_custom”), and does not pass the linkage onto the initial relay IR graph.
   
   +  We have a draft solution to maintain linkages between the given pre-trained network model and its relay IR graph
   
   +  (using expr node ID and tvm custom ID, plus, a few utility functions), but would like to know whether the TVM
   
   +  community has any better or work-in-progress resolution.
   
   +
   
   +* When using TVM RPC code to exercise and run inference on a remote-hosted Mrvl ML/AI HW accelerator for the Mrvl
   
   +  subgraph, we ran into one minor issue and have made local TVM RPC enhancement so that, when a TVM RPC client sends
   
   could you explain the nature of the problem that requires the client to know the absolute path?
   
   [ccjoechou writes: First, the TVM RPC server choses path for any uploaded file under tmp randomly (which can be good to reduce possible security problem). But, in our use case, we like to have the TVM RPC client to send a “runtime” command to the RPC server side to pre-process the just uploaded file before the file can be consumed autonomously by the RPC server using pre-defined script. We can’t find a way or via a TVM example, which shows how this can be done.]
   
   ________________________________
   
   In rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_tvm-2Drfcs_pull_48-23discussion-5Fr787133542&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=9sG9W_t_KkSawUyjosauynkwCGnYieI8qZo2TOrwXQU&e=>:
   
   > +  pre-trained network model and the code-gen JSON files so that the compiler backend will be able to report user-level
   
   +  (e.g., meaningful-to-user) messages regarding the given pre-trained network. For instance, in the
   
   +  tvm/python/tvm/relay/frontend/onnx.py and common.py files, we can see user-level information being captured using
   
   +  “tvm_custom” related code as in original onnx.py file for the given pre-trained network; but, in common.py, the code
   
   +  later drops the linkage, via attrs.pop(“tvm_custom”), and does not pass the linkage onto the initial relay IR graph.
   
   +  We have a draft solution to maintain linkages between the given pre-trained network model and its relay IR graph
   
   +  (using expr node ID and tvm custom ID, plus, a few utility functions), but would like to know whether the TVM
   
   +  community has any better or work-in-progress resolution.
   
   +
   
   +* When using TVM RPC code to exercise and run inference on a remote-hosted Mrvl ML/AI HW accelerator for the Mrvl
   
   +  subgraph, we ran into one minor issue and have made local TVM RPC enhancement so that, when a TVM RPC client sends
   
   +  a file to the remote server, the TVM RPC client can know where the remote server saves the file on the remote machine.
   
   +  Since this is not directly related to this Mrvl-BYOC PR, we will find time to contribute this enhance back in another
   
   +  TVM PR soon.
   
   +
   
   +* In order for us to generate the constants-JSON file, we must “NOT” remove external params, which were stored in
   
   why is this? params passed in MetadataModule are meant for consumption only by the runtime.Module which defines them. it seems like perhaps you need to consume them at the executor level. could you explain that?
   
   [ccjoechou writes: We are using relay to generate JSON meta files representing the given network model in a way our backend code can process directly (e.g., only the marvell-part sub-graph(s). If we have include 100% of our backend code in the TVM codebase, then, we do not need to dump constants in JSON mega file; but, due to our backend code is built outside the typical TVM flow and can do other compile-time optimization including manipulating constants, we need constants JSON.]
   
   —
   Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_tvm-2Drfcs_pull_48-23pullrequestreview-2D855912667&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=L2xG3uKvJ4yJ_ixp1fyRD1bRTpZgR-yWZtSVJPsEHLo&e=>, or unsubscribe<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AM636PACHANYNAATB4KA7B3UWXGYFANCNFSM5KJYF53A&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=NB9WVT4nlWSlIeHit8jurEkvilrrj22R7iAhBtdylbI&e=>.
   Triage notifications on the go with GitHub Mobile for iOS<https://urldefense.proofpoint.com/v2/url?u=https-3A__apps.apple.com_app_apple-2Dstore_id1477376905-3Fct-3Dnotification-2Demail-26mt-3D8-26pt-3D524675&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=-StSSo5XOXXHK4bMwp_B8eLiRGQ0DXvLD_42vrZOwyQ&e=> or Android<https://urldefense.proofpoint.com/v2/url?u=https-3A__play.google.com_store_apps_details-3Fid-3Dcom.github.android-26referrer-3Dutm-5Fcampaign-253Dnotification-2Demail-2526utm-5Fmedium-253Demail-2526utm-5Fsource-253Dgithub&d=DwMFaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=84zI-iH2xz28Q_xujSGbjYaq38CgcRgl_vGzCtF6TwQ&m=VNqzldKdQt96drtn_9Nfv7pyo4q1twvxtd6wMIR7FYI4cpvEKxaaVtBsanbms18J&s=HJ1qy1TGjh0KDSnnPWyF-MoI-UYA3X9I_er2DAq4ICs&e=>.
   You are receiving this because you were mentioned.Message ID: ***@***.***>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r787281642



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
+    * Some common classes have been moved from the original src/relay/backend/graph_executor_codegen.cc file to the
+      new src/relay/backend/graph_executor_codegen.h file so that they can be shared by Marvell-specific functions
+      and derived classes defined in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+
+    * new definitions are listed below:
+```
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.h file
+    /*! \brief Node types */
+    enum GraphNodeType {
+      kGraphNop,
+      kGraphInputNode,
+      kGraphOpNode,
+      kGraphInputNodeExt,
+      kGraphOpNodeExt,
+    };
+
+    
+    class ExternalJsonWriterCB {
+     public:
+      template <class T>
+      void RegisterCB(T* const object,
+                      void (T::*const mf)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                          std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        using namespace std::placeholders;
+        callback_ = std::bind(mf, object, _1, _2, _3, _4);
+        hasCallback_ = true;
+      }
+      void RegisterCB(void (*const fun)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                        std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        callback_ = fun;
+        hasCallback_ = true;
+      }
+      void Exe(dmlc::JSONWriter* external_writer, Array<tvm::runtime::Module> mod,
+               std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads) {
+        ICHECK(hasCallback_) << "ERROR: no registered callback";
+        callback_(external_writer, mod, nodes, heads);
+      }
+      inline bool HasCallback() { return hasCallback_; }
+
+     private:
+      std::function<void(dmlc::JSONWriter*, Array<tvm::runtime::Module>, std::vector<GraphObjectPtr>,
+                         std::vector<GraphNodeRef>)>
+          callback_;
+      bool hasCallback_{false};
+    };
+
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.cc file
+    class GraphExecutorCodegen : public backend::MemoizedExprTranslator<std::vector<GraphNodeRef>> {
+     public:
+      GraphExecutorCodegen(runtime::Module* mod, const TargetMap& targets)
+          : mod_(mod), targets_(targets) {
+        // we need the following variable to be a static member of the class so we can access
+        //   its setting in the following static GetExternalJsonWriter() function; but this static
+        //   member can actually be used as a local Callback setting for "per" GraphExecutorCodegen
+        //   instantiation during each TVM build-codegen flow
+        external_json_writer_ = std::make_shared<ExternalJsonWriterCB>();
+        ICHECK(external_json_writer_);
+      }
+      static ExternalJsonWriterCB* GetExternalJsonWriter() { return external_json_writer_.get(); }
+      ....
+      LoweredOutput Codegen(IRModule mod, relay::Function func, String mod_name) {
+        ....
+
+        // if it has been registered for this GraphExecutorCodegen object, call the external JSON writer
+        if (external_json_writer_->HasCallback()) {
+          std::ostringstream external_os;
+          dmlc::JSONWriter external_writer(&external_os);
+          external_json_writer_->Exe(&external_writer, ret.external_mods, nodes_, heads_);
+          ret.external_graph_json = external_os.str();
+        }
+
+        return ret;
+      }
+    };
+
+    extern "C" ExternalJsonWriterCB* GetExternalJsonWriter() {
+      return GraphExecutorCodegen::GetExternalJsonWriter();
+    }
+
+    /////////////
+    // in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+    // Marvell-specific extentions
+    class GraphInputNodeMrvlExt : public GraphInputNode {
+        ...
+        GraphNodeType Type() const override { return kGraphInputNodeExt; }
+        void Save(dmlc::JSONWriter* writer) const override { /* extensions */ }
+    }
+
+    class GraphOpNodeMrvlExt : public GraphOpNode {
+        ...
+        GraphNodeType Type() const override { return kGraphOpNodeExt; }
+        void Load(dmlc::JSONReader* reader) override;
+        void LoadAttrs(dmlc::JSONReader* reader);
+        std::pair<std::string, GraphAttrs> GetLoadedGraphAttrs();
+    }
+
+    class MrvlExtJson {
+     public:
+      MrvlExtJson() {
+        ICHECK(!GetExternalJsonWriter()->HasCallback()) << "ERROR: has registered callback";
+        GetExternalJsonWriter()->RegisterCB(this, &MrvlExtJson::GetExternalJSON);
+      }
+      virtual ~MrvlExtJson() {}
+      void GetExternalJSON(dmlc::JSONWriter* writer, Array<tvm::runtime::Module> external_mods,
+                           std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads);
+      void LoadExternalJsonAttrs(std::unordered_map<std::string, GraphAttrs>* external_attrs_map,
+                                 const Array<tvm::runtime::Module>& external_mods);
+    };
+```
+
+* the need to link between pre-trained model and final Marvell backend layer - for instance, through tvm_custom
+    * We did not include prototype code in PR-9730 but intend to provide our sample changes in another RFC and PR.
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+* We haven't identified any major *not* do items. Several other designs are by choices - that is we understand that
+  there are benefits for doing or benefits for not-doing.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+* We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.
+
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+* We are following the existing TVM BYOC framework and example files.
+    * for example: to do IR compositions, to define own IR passes, to mix implementations in Python/C++, and etc.
+
+* We have extended graph_executor_codegen.cc and JSON loader/saver in order to read and write out Marvell specific
+  attributes
+
+* Currently, we haven't spend enough time to under how tvm/rust/cargo requirements and steps. Therefore, we are
+  bypassing the tvm/Jenkinsfile's tests/scripts/task_rust.sh step. We will need help to re-enable the step.
+
+* We like to duplicate the Jenkins environment in order to run tvm/Jenkinsfile as is, but, we ran into many issues.
+  Currently, we have a tvm-like Jenksinsfile environment to only run a subset of test suites using a modified
+  Jenkinsfile.
+
+* We have identified a need to allow a call-back function to be registered when generating Mrvl-BYOC-specific
+  Nodes-JSON file. We are trying to follow TVM Python/CPP-CB style as much as possible. But, since our callback
+  function tvm/src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc::GetExternalJSON() function is using
+  non-simple argument types, we need help from TVM community to provide suggestions/guidelines in order to make
+  new CB code better to meet TVM community requirements here.
+
+* For one Mrvl-BYOC relay transformation pass, we have identified a need to inject a (global) expr node ID for the
+  RelayExprNode class and its derived classes: Tuple and CallNode, so that during the transformation pass, we can
+  uniquely identify each Tuple or CallNode object. Again, we need help from TVM community to provide
+  suggestions/guidelines here in order to know whether this is one of the best ways to achieve the Mrvl-BYOC need.
+
+* We also identified a need to maintain linkages between (operator-)information described in the original, given
+  pre-trained network model and the code-gen JSON files so that the compiler backend will be able to report user-level
+  (e.g., meaningful-to-user) messages regarding the given pre-trained network. For instance, in the
+  tvm/python/tvm/relay/frontend/onnx.py and common.py files, we can see user-level information being captured using
+  “tvm_custom” related code as in original onnx.py file for the given pre-trained network; but, in common.py, the code
+  later drops the linkage, via attrs.pop(“tvm_custom”), and does not pass the linkage onto the initial relay IR graph.
+  We have a draft solution to maintain linkages between the given pre-trained network model and its relay IR graph
+  (using expr node ID and tvm custom ID, plus, a few utility functions), but would like to know whether the TVM
+  community has any better or work-in-progress resolution.
+
+* When using TVM RPC code to exercise and run inference on a remote-hosted Mrvl ML/AI HW accelerator for the Mrvl
+  subgraph, we ran into one minor issue and have made local TVM RPC enhancement so that, when a TVM RPC client sends

Review comment:
       First, the TVM RPC server choses path for any uploaded file under tmp randomly (which can be good to reduce possible security problem). But, in our use case, we like to have the TVM RPC client to send a “runtime” command to the RPC server side to pre-process the just uploaded file before the file can be consumed autonomously by the RPC server using pre-defined script. We can’t find a way or via a TVM example, which shows how this can be done -- unless the client knows the uploaded file's 
   path on server.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r787292940



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
+    * Some common classes have been moved from the original src/relay/backend/graph_executor_codegen.cc file to the
+      new src/relay/backend/graph_executor_codegen.h file so that they can be shared by Marvell-specific functions
+      and derived classes defined in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+
+    * new definitions are listed below:
+```
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.h file
+    /*! \brief Node types */
+    enum GraphNodeType {
+      kGraphNop,
+      kGraphInputNode,
+      kGraphOpNode,
+      kGraphInputNodeExt,
+      kGraphOpNodeExt,
+    };
+
+    
+    class ExternalJsonWriterCB {
+     public:
+      template <class T>
+      void RegisterCB(T* const object,
+                      void (T::*const mf)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                          std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        using namespace std::placeholders;
+        callback_ = std::bind(mf, object, _1, _2, _3, _4);
+        hasCallback_ = true;
+      }
+      void RegisterCB(void (*const fun)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                        std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        callback_ = fun;
+        hasCallback_ = true;
+      }
+      void Exe(dmlc::JSONWriter* external_writer, Array<tvm::runtime::Module> mod,
+               std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads) {
+        ICHECK(hasCallback_) << "ERROR: no registered callback";
+        callback_(external_writer, mod, nodes, heads);
+      }
+      inline bool HasCallback() { return hasCallback_; }
+
+     private:
+      std::function<void(dmlc::JSONWriter*, Array<tvm::runtime::Module>, std::vector<GraphObjectPtr>,
+                         std::vector<GraphNodeRef>)>
+          callback_;
+      bool hasCallback_{false};
+    };
+
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.cc file
+    class GraphExecutorCodegen : public backend::MemoizedExprTranslator<std::vector<GraphNodeRef>> {
+     public:
+      GraphExecutorCodegen(runtime::Module* mod, const TargetMap& targets)
+          : mod_(mod), targets_(targets) {
+        // we need the following variable to be a static member of the class so we can access
+        //   its setting in the following static GetExternalJsonWriter() function; but this static
+        //   member can actually be used as a local Callback setting for "per" GraphExecutorCodegen
+        //   instantiation during each TVM build-codegen flow
+        external_json_writer_ = std::make_shared<ExternalJsonWriterCB>();
+        ICHECK(external_json_writer_);
+      }
+      static ExternalJsonWriterCB* GetExternalJsonWriter() { return external_json_writer_.get(); }
+      ....
+      LoweredOutput Codegen(IRModule mod, relay::Function func, String mod_name) {
+        ....
+
+        // if it has been registered for this GraphExecutorCodegen object, call the external JSON writer
+        if (external_json_writer_->HasCallback()) {
+          std::ostringstream external_os;
+          dmlc::JSONWriter external_writer(&external_os);
+          external_json_writer_->Exe(&external_writer, ret.external_mods, nodes_, heads_);
+          ret.external_graph_json = external_os.str();
+        }
+
+        return ret;
+      }
+    };
+
+    extern "C" ExternalJsonWriterCB* GetExternalJsonWriter() {
+      return GraphExecutorCodegen::GetExternalJsonWriter();
+    }
+
+    /////////////
+    // in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+    // Marvell-specific extentions
+    class GraphInputNodeMrvlExt : public GraphInputNode {
+        ...
+        GraphNodeType Type() const override { return kGraphInputNodeExt; }
+        void Save(dmlc::JSONWriter* writer) const override { /* extensions */ }
+    }
+
+    class GraphOpNodeMrvlExt : public GraphOpNode {
+        ...
+        GraphNodeType Type() const override { return kGraphOpNodeExt; }
+        void Load(dmlc::JSONReader* reader) override;
+        void LoadAttrs(dmlc::JSONReader* reader);
+        std::pair<std::string, GraphAttrs> GetLoadedGraphAttrs();
+    }
+
+    class MrvlExtJson {
+     public:
+      MrvlExtJson() {
+        ICHECK(!GetExternalJsonWriter()->HasCallback()) << "ERROR: has registered callback";
+        GetExternalJsonWriter()->RegisterCB(this, &MrvlExtJson::GetExternalJSON);
+      }
+      virtual ~MrvlExtJson() {}
+      void GetExternalJSON(dmlc::JSONWriter* writer, Array<tvm::runtime::Module> external_mods,
+                           std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads);
+      void LoadExternalJsonAttrs(std::unordered_map<std::string, GraphAttrs>* external_attrs_map,
+                                 const Array<tvm::runtime::Module>& external_mods);
+    };
+```
+
+* the need to link between pre-trained model and final Marvell backend layer - for instance, through tvm_custom
+    * We did not include prototype code in PR-9730 but intend to provide our sample changes in another RFC and PR.
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+* We haven't identified any major *not* do items. Several other designs are by choices - that is we understand that
+  there are benefits for doing or benefits for not-doing.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+* We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.

Review comment:
       the email version of this question was linking to a different RFC segment (the light green section above) -- led to me to answer differently in my in-line reply to your email.
   Sorry and since I noticed here and can see the correct light green block above corresponding to your question, so let me reply here again properly.
   
   Please check my reply to the previous question (e.g., we need Marvell-specific GraphOpNode and GraphInputNode in order to dump out Marvell specific attributes to node-JSON file).
   Also, since we are also using our build-from-outside-typical TVM flow Marvell compiler backend component, which can do additional compile-time optimizations and is reading in graph-executor-generated JSON meta data, currently, we don't think using runtime.Module to generate Marvell-specific JSON meta files of network is the way to go here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#issuecomment-1027512160


   Hi @areusch:
   I have updated RFC-#48 based on most of your feedback. New RFC md file and 4 new figures have been uploaded.
   Please take a look again (in a view document mode I can see new figures being displayed inside RFC).
   We are taking couple of your TIR-related advices and will start reviewing RFC-0010 and TVM TIR files.
   For our TVM PR-9730 POC code, I have renamed en_id to exprnode_id for our changes and update that PR soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou closed pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou closed pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r787282472



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
+    * Some common classes have been moved from the original src/relay/backend/graph_executor_codegen.cc file to the
+      new src/relay/backend/graph_executor_codegen.h file so that they can be shared by Marvell-specific functions
+      and derived classes defined in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+
+    * new definitions are listed below:
+```
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.h file
+    /*! \brief Node types */
+    enum GraphNodeType {
+      kGraphNop,
+      kGraphInputNode,
+      kGraphOpNode,
+      kGraphInputNodeExt,
+      kGraphOpNodeExt,
+    };
+
+    
+    class ExternalJsonWriterCB {
+     public:
+      template <class T>
+      void RegisterCB(T* const object,
+                      void (T::*const mf)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                          std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        using namespace std::placeholders;
+        callback_ = std::bind(mf, object, _1, _2, _3, _4);
+        hasCallback_ = true;
+      }
+      void RegisterCB(void (*const fun)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                        std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        callback_ = fun;
+        hasCallback_ = true;
+      }
+      void Exe(dmlc::JSONWriter* external_writer, Array<tvm::runtime::Module> mod,
+               std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads) {
+        ICHECK(hasCallback_) << "ERROR: no registered callback";
+        callback_(external_writer, mod, nodes, heads);
+      }
+      inline bool HasCallback() { return hasCallback_; }
+
+     private:
+      std::function<void(dmlc::JSONWriter*, Array<tvm::runtime::Module>, std::vector<GraphObjectPtr>,
+                         std::vector<GraphNodeRef>)>
+          callback_;
+      bool hasCallback_{false};
+    };
+
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.cc file
+    class GraphExecutorCodegen : public backend::MemoizedExprTranslator<std::vector<GraphNodeRef>> {
+     public:
+      GraphExecutorCodegen(runtime::Module* mod, const TargetMap& targets)
+          : mod_(mod), targets_(targets) {
+        // we need the following variable to be a static member of the class so we can access
+        //   its setting in the following static GetExternalJsonWriter() function; but this static
+        //   member can actually be used as a local Callback setting for "per" GraphExecutorCodegen
+        //   instantiation during each TVM build-codegen flow
+        external_json_writer_ = std::make_shared<ExternalJsonWriterCB>();
+        ICHECK(external_json_writer_);
+      }
+      static ExternalJsonWriterCB* GetExternalJsonWriter() { return external_json_writer_.get(); }
+      ....
+      LoweredOutput Codegen(IRModule mod, relay::Function func, String mod_name) {
+        ....
+
+        // if it has been registered for this GraphExecutorCodegen object, call the external JSON writer
+        if (external_json_writer_->HasCallback()) {
+          std::ostringstream external_os;
+          dmlc::JSONWriter external_writer(&external_os);
+          external_json_writer_->Exe(&external_writer, ret.external_mods, nodes_, heads_);
+          ret.external_graph_json = external_os.str();
+        }
+
+        return ret;
+      }
+    };
+
+    extern "C" ExternalJsonWriterCB* GetExternalJsonWriter() {
+      return GraphExecutorCodegen::GetExternalJsonWriter();
+    }
+
+    /////////////
+    // in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+    // Marvell-specific extentions
+    class GraphInputNodeMrvlExt : public GraphInputNode {
+        ...
+        GraphNodeType Type() const override { return kGraphInputNodeExt; }
+        void Save(dmlc::JSONWriter* writer) const override { /* extensions */ }
+    }
+
+    class GraphOpNodeMrvlExt : public GraphOpNode {
+        ...
+        GraphNodeType Type() const override { return kGraphOpNodeExt; }
+        void Load(dmlc::JSONReader* reader) override;
+        void LoadAttrs(dmlc::JSONReader* reader);
+        std::pair<std::string, GraphAttrs> GetLoadedGraphAttrs();
+    }
+
+    class MrvlExtJson {
+     public:
+      MrvlExtJson() {
+        ICHECK(!GetExternalJsonWriter()->HasCallback()) << "ERROR: has registered callback";
+        GetExternalJsonWriter()->RegisterCB(this, &MrvlExtJson::GetExternalJSON);
+      }
+      virtual ~MrvlExtJson() {}
+      void GetExternalJSON(dmlc::JSONWriter* writer, Array<tvm::runtime::Module> external_mods,
+                           std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads);
+      void LoadExternalJsonAttrs(std::unordered_map<std::string, GraphAttrs>* external_attrs_map,
+                                 const Array<tvm::runtime::Module>& external_mods);
+    };
+```
+
+* the need to link between pre-trained model and final Marvell backend layer - for instance, through tvm_custom
+    * We did not include prototype code in PR-9730 but intend to provide our sample changes in another RFC and PR.
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+* We haven't identified any major *not* do items. Several other designs are by choices - that is we understand that
+  there are benefits for doing or benefits for not-doing.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+* We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.
+
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+* We are following the existing TVM BYOC framework and example files.
+    * for example: to do IR compositions, to define own IR passes, to mix implementations in Python/C++, and etc.
+
+* We have extended graph_executor_codegen.cc and JSON loader/saver in order to read and write out Marvell specific
+  attributes
+
+* Currently, we haven't spend enough time to under how tvm/rust/cargo requirements and steps. Therefore, we are
+  bypassing the tvm/Jenkinsfile's tests/scripts/task_rust.sh step. We will need help to re-enable the step.
+
+* We like to duplicate the Jenkins environment in order to run tvm/Jenkinsfile as is, but, we ran into many issues.
+  Currently, we have a tvm-like Jenksinsfile environment to only run a subset of test suites using a modified
+  Jenkinsfile.
+
+* We have identified a need to allow a call-back function to be registered when generating Mrvl-BYOC-specific
+  Nodes-JSON file. We are trying to follow TVM Python/CPP-CB style as much as possible. But, since our callback
+  function tvm/src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc::GetExternalJSON() function is using
+  non-simple argument types, we need help from TVM community to provide suggestions/guidelines in order to make
+  new CB code better to meet TVM community requirements here.
+
+* For one Mrvl-BYOC relay transformation pass, we have identified a need to inject a (global) expr node ID for the
+  RelayExprNode class and its derived classes: Tuple and CallNode, so that during the transformation pass, we can
+  uniquely identify each Tuple or CallNode object. Again, we need help from TVM community to provide
+  suggestions/guidelines here in order to know whether this is one of the best ways to achieve the Mrvl-BYOC need.
+
+* We also identified a need to maintain linkages between (operator-)information described in the original, given
+  pre-trained network model and the code-gen JSON files so that the compiler backend will be able to report user-level
+  (e.g., meaningful-to-user) messages regarding the given pre-trained network. For instance, in the
+  tvm/python/tvm/relay/frontend/onnx.py and common.py files, we can see user-level information being captured using
+  “tvm_custom” related code as in original onnx.py file for the given pre-trained network; but, in common.py, the code
+  later drops the linkage, via attrs.pop(“tvm_custom”), and does not pass the linkage onto the initial relay IR graph.
+  We have a draft solution to maintain linkages between the given pre-trained network model and its relay IR graph
+  (using expr node ID and tvm custom ID, plus, a few utility functions), but would like to know whether the TVM
+  community has any better or work-in-progress resolution.
+
+* When using TVM RPC code to exercise and run inference on a remote-hosted Mrvl ML/AI HW accelerator for the Mrvl
+  subgraph, we ran into one minor issue and have made local TVM RPC enhancement so that, when a TVM RPC client sends
+  a file to the remote server, the TVM RPC client can know where the remote server saves the file on the remote machine.
+  Since this is not directly related to this Mrvl-BYOC PR, we will find time to contribute this enhance back in another
+  TVM PR soon.
+
+* In order for us to generate the constants-JSON file, we must “NOT” remove external params, which were stored in

Review comment:
       We are using relay to generate JSON meta files representing the given network model in a way our (compiler-) backend code can process directly at compile time (e.g., only the marvell-part sub-graph(s). If we have include 100% of our backend code in the TVM codebase, then, we do not need to dump constants in JSON mega file; but, due to our backend code is built outside the typical TVM flow and can do other compile-time optimizations including manipulating constants, we need constants JSON.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r787280651



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
+    * Some common classes have been moved from the original src/relay/backend/graph_executor_codegen.cc file to the
+      new src/relay/backend/graph_executor_codegen.h file so that they can be shared by Marvell-specific functions
+      and derived classes defined in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+
+    * new definitions are listed below:
+```
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.h file
+    /*! \brief Node types */
+    enum GraphNodeType {
+      kGraphNop,
+      kGraphInputNode,
+      kGraphOpNode,
+      kGraphInputNodeExt,
+      kGraphOpNodeExt,
+    };
+
+    
+    class ExternalJsonWriterCB {
+     public:
+      template <class T>
+      void RegisterCB(T* const object,
+                      void (T::*const mf)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                          std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        using namespace std::placeholders;
+        callback_ = std::bind(mf, object, _1, _2, _3, _4);
+        hasCallback_ = true;
+      }
+      void RegisterCB(void (*const fun)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                        std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        callback_ = fun;
+        hasCallback_ = true;
+      }
+      void Exe(dmlc::JSONWriter* external_writer, Array<tvm::runtime::Module> mod,
+               std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads) {
+        ICHECK(hasCallback_) << "ERROR: no registered callback";
+        callback_(external_writer, mod, nodes, heads);
+      }
+      inline bool HasCallback() { return hasCallback_; }
+
+     private:
+      std::function<void(dmlc::JSONWriter*, Array<tvm::runtime::Module>, std::vector<GraphObjectPtr>,
+                         std::vector<GraphNodeRef>)>
+          callback_;
+      bool hasCallback_{false};
+    };
+
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.cc file
+    class GraphExecutorCodegen : public backend::MemoizedExprTranslator<std::vector<GraphNodeRef>> {
+     public:
+      GraphExecutorCodegen(runtime::Module* mod, const TargetMap& targets)
+          : mod_(mod), targets_(targets) {
+        // we need the following variable to be a static member of the class so we can access
+        //   its setting in the following static GetExternalJsonWriter() function; but this static
+        //   member can actually be used as a local Callback setting for "per" GraphExecutorCodegen
+        //   instantiation during each TVM build-codegen flow
+        external_json_writer_ = std::make_shared<ExternalJsonWriterCB>();
+        ICHECK(external_json_writer_);
+      }
+      static ExternalJsonWriterCB* GetExternalJsonWriter() { return external_json_writer_.get(); }
+      ....
+      LoweredOutput Codegen(IRModule mod, relay::Function func, String mod_name) {
+        ....
+
+        // if it has been registered for this GraphExecutorCodegen object, call the external JSON writer
+        if (external_json_writer_->HasCallback()) {
+          std::ostringstream external_os;
+          dmlc::JSONWriter external_writer(&external_os);
+          external_json_writer_->Exe(&external_writer, ret.external_mods, nodes_, heads_);
+          ret.external_graph_json = external_os.str();
+        }
+
+        return ret;
+      }
+    };
+
+    extern "C" ExternalJsonWriterCB* GetExternalJsonWriter() {
+      return GraphExecutorCodegen::GetExternalJsonWriter();
+    }
+
+    /////////////
+    // in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+    // Marvell-specific extentions
+    class GraphInputNodeMrvlExt : public GraphInputNode {
+        ...
+        GraphNodeType Type() const override { return kGraphInputNodeExt; }
+        void Save(dmlc::JSONWriter* writer) const override { /* extensions */ }
+    }
+
+    class GraphOpNodeMrvlExt : public GraphOpNode {
+        ...
+        GraphNodeType Type() const override { return kGraphOpNodeExt; }
+        void Load(dmlc::JSONReader* reader) override;
+        void LoadAttrs(dmlc::JSONReader* reader);
+        std::pair<std::string, GraphAttrs> GetLoadedGraphAttrs();
+    }
+
+    class MrvlExtJson {
+     public:
+      MrvlExtJson() {
+        ICHECK(!GetExternalJsonWriter()->HasCallback()) << "ERROR: has registered callback";
+        GetExternalJsonWriter()->RegisterCB(this, &MrvlExtJson::GetExternalJSON);
+      }
+      virtual ~MrvlExtJson() {}
+      void GetExternalJSON(dmlc::JSONWriter* writer, Array<tvm::runtime::Module> external_mods,
+                           std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads);
+      void LoadExternalJsonAttrs(std::unordered_map<std::string, GraphAttrs>* external_attrs_map,
+                                 const Array<tvm::runtime::Module>& external_mods);
+    };
+```
+
+* the need to link between pre-trained model and final Marvell backend layer - for instance, through tvm_custom
+    * We did not include prototype code in PR-9730 but intend to provide our sample changes in another RFC and PR.
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+* We haven't identified any major *not* do items. Several other designs are by choices - that is we understand that
+  there are benefits for doing or benefits for not-doing.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+* We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.
+
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+* We are following the existing TVM BYOC framework and example files.
+    * for example: to do IR compositions, to define own IR passes, to mix implementations in Python/C++, and etc.
+
+* We have extended graph_executor_codegen.cc and JSON loader/saver in order to read and write out Marvell specific
+  attributes
+
+* Currently, we haven't spend enough time to under how tvm/rust/cargo requirements and steps. Therefore, we are
+  bypassing the tvm/Jenkinsfile's tests/scripts/task_rust.sh step. We will need help to re-enable the step.
+
+* We like to duplicate the Jenkins environment in order to run tvm/Jenkinsfile as is, but, we ran into many issues.
+  Currently, we have a tvm-like Jenksinsfile environment to only run a subset of test suites using a modified
+  Jenkinsfile.
+
+* We have identified a need to allow a call-back function to be registered when generating Mrvl-BYOC-specific
+  Nodes-JSON file. We are trying to follow TVM Python/CPP-CB style as much as possible. But, since our callback
+  function tvm/src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc::GetExternalJSON() function is using
+  non-simple argument types, we need help from TVM community to provide suggestions/guidelines in order to make
+  new CB code better to meet TVM community requirements here.
+
+* For one Mrvl-BYOC relay transformation pass, we have identified a need to inject a (global) expr node ID for the
+  RelayExprNode class and its derived classes: Tuple and CallNode, so that during the transformation pass, we can
+  uniquely identify each Tuple or CallNode object. Again, we need help from TVM community to provide
+  suggestions/guidelines here in order to know whether this is one of the best ways to achieve the Mrvl-BYOC need.

Review comment:
       yes but not just us but a data-scientist customer who is using TVM flow may like to know, for example, the linkages between the runtime performance #s (which were provided by driver and/or hardware) and their corresponding user frontend model’s operators (e.g., each expression which customer knows)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r797193262



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;

Review comment:
       We have not spend time on the TIR flow and passes - we will.
   One quick question, is TIR buffer and its data-layout can lead how inputs/outputs of Marvell sub-graphs and LLVM-non-Marvell sub-graphs are communicated during inference runtime?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r797201350



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] */
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell (backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually speaking.
+```
+      # from original IR graphs
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+
+
+      # from Mrvl subgraph
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      def @tvmgen_mrvl_main_3(%mrvl_3_i0: Tensor[(1, 14, 14, 64), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_3", Primitive=1) -> Tensor[(1, 14, 14, 32), float32] {
+
+        %13 = fn (%FunctionVar_0_0: Tensor[(1, 14, 14, 64), float32], PartitionedFromPattern="nn.conv2d_add_nn.relu_",
+            Composite="mrvl.conv2d_nhwc2nhwc") -> Tensor[(1, 14, 14, 32), float32] {
+          %11 = nn.conv2d(%FunctionVar_0_0, meta[relay.Constant][2] /* ty=Tensor[(32, 2, 2, 64), float32] */,
+              padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], data_layout="NHWC", kernel_layout="OHWI",
+              out_layout="NHWC", /* en_id=781 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          %12 = add(%11, meta[relay.Constant][3] /* ty=Tensor[(1, 1, 1, 32), float32] */,
+              /* en_id=789 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+          nn.relu(%12, /* en_id=793 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+        };
+
+        %13(%mrvl_3_i0, /* en_id=3343 */) /* ty=Tensor[(1, 14, 14, 32), float32] */
+      }
+```
+
+* Because Marvell backend layer uses NHWC format (for instance, for Conv2D, Pool2D, and Sum2D),
+    the relay.transform.ConvertLayout() pass is applied in the mrvl.py file. As a result, NHWC format is used
+    for Marvell layer: tvmgen_mrvl_main_1 to tvmgen_mrvl_main_4. In addition, the first tvmgen_mrvl_main_0 layer
+    is corresponding to a layout_transform() operation, which takes the original input tensor in src_layout="NCHW"
+    and convert the input to a dst_layout="NHWC" tensor.
+
+```
+      relay.transform.ConvertLayout(
+          {"nn.conv2d": ["NHWC", "OHWI"], "nn.max_pool2d": ["NHWC"]}
+      ),
+
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+
+      def @tvmgen_mrvl_main_0(%mrvl_0_i0: Tensor[(1, 1, 28, 28), float32], Inline=1, Compiler="mrvl",
+          global_symbol="tvmgen_mrvl_main_0", Primitive=1) -> Tensor[(1, 28, 28, 1), float32] {
+        layout_transform(%mrvl_0_i0, src_layout="NCHW", dst_layout="NHWC",
+            /* en_id=3334 */) /* ty=Tensor[(1, 28, 28, 1), float32] */
+      }
+```
+
+* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
+  subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
+  stored for the Class CallNode and the class Tuple (include/tvm/relay/expr.h).
+    * in mrvl.py: class MrvlIRGraphUtils.RestOfMrvlLayers(ExprMutator) is used to convert the non-Mrvl subgraph,
+      which can have composited Marvell layer(s) back to their original IR nodes (e.g., to use original tensor
+      layout and with no compositions)
+    * in mrvl.py: class MrvlIRGraphUtils.RestMrvlLayersGetInputs(ExprVisitor) is used to reconstruct the input
+      tensor for the non-Mrvl subgraph so that it become a IR graph, which is recognized by the TVM LLVM build.
+    * in mrvl.py: the revert_mrvl_mod_to_orig() function is defined to convert the initial non-Mrvl subgraph back
+      to a IR subgraph using original layouts with no Marvell-specific compositions (e.g., similar to what was
+      given by the frontend)
+
+```
+def revert_mrvl_mod_to_orig(mod_mrvl_subgraph, mrvl_layers_in_mrvl_subgraph, debug=False):
+    """
+
+    def run_opt_pass(mod, passes):
+        passes = passes if isinstance(passes, list) else [passes]
+        seq = tvm.transform.Sequential(passes)
+        with tvm.transform.PassContext(opt_level=3):
+            mod = seq(mod)
+        return mod
+
+    mod_new = tvm.IRModule(mod_mrvl.functions, mod_mrvl.type_definitions)
+    mod_new["main"] = MrvlSubgraphToRevert(mrvl_layers_in_mrvl_subgraph, mod_mrvl).visit(mod_mrvl["main"])
+    mod_new = relay.transform.RemoveUnusedFunctions()(mod_new)
+    mod_new = relay.transform.InferType()(mod_new)
+    mod_new = run_opt_pass(mod_new, relay.transform.DefuseOps())
+    mod_new = run_opt_pass(mod_new, relay.transform.ConvertLayout({"nn.conv2d": ["NCHW", "OIHW"], "nn.max_pool2d": ["NCHW"]}))
+    mod_new = run_opt_pass(mod_new, relay.transform.SimplifyExpr())
+    mod_new = run_opt_pass(mod_new, relay.transform._ffi_api.DropNoopTranspose())
+    mod_new = run_opt_pass(mod_new, relay.transform.InferType())
+    return mod_new
+```
+
+* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
+    * Some common classes have been moved from the original src/relay/backend/graph_executor_codegen.cc file to the
+      new src/relay/backend/graph_executor_codegen.h file so that they can be shared by Marvell-specific functions
+      and derived classes defined in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+
+    * new definitions are listed below:
+```
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.h file
+    /*! \brief Node types */
+    enum GraphNodeType {
+      kGraphNop,
+      kGraphInputNode,
+      kGraphOpNode,
+      kGraphInputNodeExt,
+      kGraphOpNodeExt,
+    };
+
+    
+    class ExternalJsonWriterCB {
+     public:
+      template <class T>
+      void RegisterCB(T* const object,
+                      void (T::*const mf)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                          std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        using namespace std::placeholders;
+        callback_ = std::bind(mf, object, _1, _2, _3, _4);
+        hasCallback_ = true;
+      }
+      void RegisterCB(void (*const fun)(dmlc::JSONWriter*, Array<tvm::runtime::Module>,
+                                        std::vector<GraphObjectPtr>, std::vector<GraphNodeRef>)) {
+        callback_ = fun;
+        hasCallback_ = true;
+      }
+      void Exe(dmlc::JSONWriter* external_writer, Array<tvm::runtime::Module> mod,
+               std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads) {
+        ICHECK(hasCallback_) << "ERROR: no registered callback";
+        callback_(external_writer, mod, nodes, heads);
+      }
+      inline bool HasCallback() { return hasCallback_; }
+
+     private:
+      std::function<void(dmlc::JSONWriter*, Array<tvm::runtime::Module>, std::vector<GraphObjectPtr>,
+                         std::vector<GraphNodeRef>)>
+          callback_;
+      bool hasCallback_{false};
+    };
+
+    /////////////
+    // in the new src/relay/backend/graph_executor_codegen.cc file
+    class GraphExecutorCodegen : public backend::MemoizedExprTranslator<std::vector<GraphNodeRef>> {
+     public:
+      GraphExecutorCodegen(runtime::Module* mod, const TargetMap& targets)
+          : mod_(mod), targets_(targets) {
+        // we need the following variable to be a static member of the class so we can access
+        //   its setting in the following static GetExternalJsonWriter() function; but this static
+        //   member can actually be used as a local Callback setting for "per" GraphExecutorCodegen
+        //   instantiation during each TVM build-codegen flow
+        external_json_writer_ = std::make_shared<ExternalJsonWriterCB>();
+        ICHECK(external_json_writer_);
+      }
+      static ExternalJsonWriterCB* GetExternalJsonWriter() { return external_json_writer_.get(); }
+      ....
+      LoweredOutput Codegen(IRModule mod, relay::Function func, String mod_name) {
+        ....
+
+        // if it has been registered for this GraphExecutorCodegen object, call the external JSON writer
+        if (external_json_writer_->HasCallback()) {
+          std::ostringstream external_os;
+          dmlc::JSONWriter external_writer(&external_os);
+          external_json_writer_->Exe(&external_writer, ret.external_mods, nodes_, heads_);
+          ret.external_graph_json = external_os.str();
+        }
+
+        return ret;
+      }
+    };
+
+    extern "C" ExternalJsonWriterCB* GetExternalJsonWriter() {
+      return GraphExecutorCodegen::GetExternalJsonWriter();
+    }
+
+    /////////////
+    // in the new src/relay/backend/contrib/mrvl/graph_executor_codegen.cc file
+    // Marvell-specific extentions
+    class GraphInputNodeMrvlExt : public GraphInputNode {
+        ...
+        GraphNodeType Type() const override { return kGraphInputNodeExt; }
+        void Save(dmlc::JSONWriter* writer) const override { /* extensions */ }
+    }
+
+    class GraphOpNodeMrvlExt : public GraphOpNode {
+        ...
+        GraphNodeType Type() const override { return kGraphOpNodeExt; }
+        void Load(dmlc::JSONReader* reader) override;
+        void LoadAttrs(dmlc::JSONReader* reader);
+        std::pair<std::string, GraphAttrs> GetLoadedGraphAttrs();
+    }
+
+    class MrvlExtJson {
+     public:
+      MrvlExtJson() {
+        ICHECK(!GetExternalJsonWriter()->HasCallback()) << "ERROR: has registered callback";
+        GetExternalJsonWriter()->RegisterCB(this, &MrvlExtJson::GetExternalJSON);
+      }
+      virtual ~MrvlExtJson() {}
+      void GetExternalJSON(dmlc::JSONWriter* writer, Array<tvm::runtime::Module> external_mods,
+                           std::vector<GraphObjectPtr> nodes, std::vector<GraphNodeRef> heads);
+      void LoadExternalJsonAttrs(std::unordered_map<std::string, GraphAttrs>* external_attrs_map,
+                                 const Array<tvm::runtime::Module>& external_mods);
+    };
+```
+
+* the need to link between pre-trained model and final Marvell backend layer - for instance, through tvm_custom
+    * We did not include prototype code in PR-9730 but intend to provide our sample changes in another RFC and PR.
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+* We haven't identified any major *not* do items. Several other designs are by choices - that is we understand that
+  there are benefits for doing or benefits for not-doing.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+* We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.
+
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+* We are following the existing TVM BYOC framework and example files.
+    * for example: to do IR compositions, to define own IR passes, to mix implementations in Python/C++, and etc.
+
+* We have extended graph_executor_codegen.cc and JSON loader/saver in order to read and write out Marvell specific
+  attributes
+
+* Currently, we haven't spend enough time to under how tvm/rust/cargo requirements and steps. Therefore, we are
+  bypassing the tvm/Jenkinsfile's tests/scripts/task_rust.sh step. We will need help to re-enable the step.
+
+* We like to duplicate the Jenkins environment in order to run tvm/Jenkinsfile as is, but, we ran into many issues.
+  Currently, we have a tvm-like Jenksinsfile environment to only run a subset of test suites using a modified
+  Jenkinsfile.
+
+* We have identified a need to allow a call-back function to be registered when generating Mrvl-BYOC-specific
+  Nodes-JSON file. We are trying to follow TVM Python/CPP-CB style as much as possible. But, since our callback
+  function tvm/src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc::GetExternalJSON() function is using
+  non-simple argument types, we need help from TVM community to provide suggestions/guidelines in order to make
+  new CB code better to meet TVM community requirements here.
+
+* For one Mrvl-BYOC relay transformation pass, we have identified a need to inject a (global) expr node ID for the
+  RelayExprNode class and its derived classes: Tuple and CallNode, so that during the transformation pass, we can
+  uniquely identify each Tuple or CallNode object. Again, we need help from TVM community to provide
+  suggestions/guidelines here in order to know whether this is one of the best ways to achieve the Mrvl-BYOC need.
+
+* We also identified a need to maintain linkages between (operator-)information described in the original, given
+  pre-trained network model and the code-gen JSON files so that the compiler backend will be able to report user-level
+  (e.g., meaningful-to-user) messages regarding the given pre-trained network. For instance, in the
+  tvm/python/tvm/relay/frontend/onnx.py and common.py files, we can see user-level information being captured using
+  “tvm_custom” related code as in original onnx.py file for the given pre-trained network; but, in common.py, the code
+  later drops the linkage, via attrs.pop(“tvm_custom”), and does not pass the linkage onto the initial relay IR graph.
+  We have a draft solution to maintain linkages between the given pre-trained network model and its relay IR graph
+  (using expr node ID and tvm custom ID, plus, a few utility functions), but would like to know whether the TVM
+  community has any better or work-in-progress resolution.
+
+* When using TVM RPC code to exercise and run inference on a remote-hosted Mrvl ML/AI HW accelerator for the Mrvl
+  subgraph, we ran into one minor issue and have made local TVM RPC enhancement so that, when a TVM RPC client sends
+  a file to the remote server, the TVM RPC client can know where the remote server saves the file on the remote machine.
+  Since this is not directly related to this Mrvl-BYOC PR, we will find time to contribute this enhance back in another
+  TVM PR soon.
+
+* In order for us to generate the constants-JSON file, we must “NOT” remove external params, which were stored in

Review comment:
       We will review RFC-#10 to find out. Thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-rfcs] ccjoechou commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Posted by GitBox <gi...@apache.org>.
ccjoechou commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r797201727



##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: [apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: [BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: {fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions (as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative (i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] */,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] */,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 */) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] */, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 */) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;

Review comment:
       Saw your other feedback regarding RFC-#10 and we will review the RFC. Thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org