You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/12/16 12:31:12 UTC

[GitHub] [tvm-rfcs] zhuwenxi opened a new pull request #47: RFC to integrate LIBXSMM with TVM.

zhuwenxi opened a new pull request #47:
URL: https://github.com/apache/tvm-rfcs/pull/47


   pre-RFC: https://discuss.tvm.apache.org/t/rfc-top-byoc-intel-libxsmm-integration/11688


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] zhuwenxi commented on a change in pull request #47: RFC to integrate LIBXSMM with TVM.

Posted by GitBox <gi...@apache.org>.

zhuwenxi commented on a change in pull request #47:
URL: https://github.com/apache/tvm-rfcs/pull/47#discussion_r771072867



##########
File path: rfcs/0046-Intel-LIBXSMM-integration.md
##########
@@ -0,0 +1,97 @@
+# Summary
+This RFC introduces the plan of integrating LIBXSMM into TVM. LIBXSMM leverages JIT code generator to produce high efficient kernels targeting x86 architectures. 
+
+For details of LIBXSMM, please refer to:
+* [LIBXSMM User Manual](https://libxsmm.readthedocs.io/en/latest/)
+* [LIBXSMM github repo](https://github.com/hfp/libxsmm)
+
+# Motivation
+TVM has shown satisfactory performance on MLP models with CPU. However there are still some defects in the assembly code generated by LLVM which block AutoTVM/AutoScheduler from achieving optimal on GEMM.
+
+LIBXSMM is a open source library developed by Intel Lab for accelerating small matrix multiplication. It leverages the JIT code generator to generate high efficient GEMM kernels for x86 CPU, which could be very close to hardware rootline. According to our evaluation, in “small” GEMM (cube_root(m * n * k) <= 256) , LIBXSMM shows a superior performance over the well-known BLAS library Intel MKL. 
+
+By the way, given that LIBXSMM can generate quite efficient GEMM kernel implementation, it is also an ideal substitution for inner-kernel of normal size GEMM. According our experiments, the AutoTVM templates we wrote with LIBXSMM as register-block generation, has a much higher performance comparing to MKL and existing TOPI implementation.
+
+# Guide-level explanation
+This proposal aims to integrate LIBXSMM into TVM to accelerate small GEMM and serve as inner-kernel to accelerate normal size GEMM.
+
+We will integrate LIBXSMM with TVM in following 3 components:
+1. Add extern call “tvm.contrib.libxsmm.gemm” in “src/runtime/contrib” directory, and corresponding python interface in "python/tvm/contrib/" directory, so users can call them just as CBLAS; 
+2. Use BYOC to accelerate small GEMM (cube_root(m * n * k ) <= 256) and its epilogue fusion variations (bias/relu/sigmoid/bias_relu/bias_sigmoid);
+3. AutoTVM template we wrote with LIBXSMM as inner kernel into TOPI, as a GEMM implementation candidate.
+
+# Reference-level explanation
+1. Users can call libxsmm as CBLAS through extern call API.
+```
+	def matmul(lhs, rhs, transa=False, transb=False, alpha=1.0, beta=0.0, lda=-1, ldb=-1, ldc=-1, 		**kwargs):
+  		n = lhs.shape[1] if transa else lhs.shape[0]
+  		m = rhs.shape[0] if transb else rhs.shape[1]
+  		return te.extern(
+    		(n, m),
+    		[lhs, rhs],
+    		lambda ins, outs: tvm.tir.call_packed(
+      		"tvm.contrib.libxsmm.matmul", ins[0], ins[1], outs[0], transa, transb, alpha, beta, lda, ldb, ldc),
+    		name="C",
+    		**kwargs,
+  		)
+```
+2. BYOC allows for graph partitioning and using LIBXSMM for code generation.
+	* API to obtain the partitioned function:
+```
+	from tvm.relay.op.contrib import libxsmm
+
+	# API to call LIBXSMM partitioning
+    libxsmm_module = libxsmm.partition_for_cmsisnn(module) 

Review comment:
       Fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] zhuwenxi commented on pull request #47: [RFC] Integrate LIBXSMM with TVM.

Posted by GitBox <gi...@apache.org>.

zhuwenxi commented on pull request #47:
URL: https://github.com/apache/tvm-rfcs/pull/47#issuecomment-1002555412


   @comaniac OK, I'll update the RFC information soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac commented on a change in pull request #47: RFC to integrate LIBXSMM with TVM.

Posted by GitBox <gi...@apache.org>.

comaniac commented on a change in pull request #47:
URL: https://github.com/apache/tvm-rfcs/pull/47#discussion_r770866633



##########
File path: rfcs/0046-Intel-LIBXSMM-integration.md
##########
@@ -0,0 +1,97 @@
+# Summary
+This RFC introduces the plan of integrating LIBXSMM into TVM. LIBXSMM leverages JIT code generator to produce high efficient kernels targeting x86 architectures. 
+
+For details of LIBXSMM, please refer to:
+* [LIBXSMM User Manual](https://libxsmm.readthedocs.io/en/latest/)
+* [LIBXSMM github repo](https://github.com/hfp/libxsmm)
+
+# Motivation
+TVM has shown satisfactory performance on MLP models with CPU. However there are still some defects in the assembly code generated by LLVM which block AutoTVM/AutoScheduler from achieving optimal on GEMM.
+
+LIBXSMM is a open source library developed by Intel Lab for accelerating small matrix multiplication. It leverages the JIT code generator to generate high efficient GEMM kernels for x86 CPU, which could be very close to hardware rootline. According to our evaluation, in “small” GEMM (cube_root(m * n * k) <= 256) , LIBXSMM shows a superior performance over the well-known BLAS library Intel MKL. 
+
+By the way, given that LIBXSMM can generate quite efficient GEMM kernel implementation, it is also an ideal substitution for inner-kernel of normal size GEMM. According our experiments, the AutoTVM templates we wrote with LIBXSMM as register-block generation, has a much higher performance comparing to MKL and existing TOPI implementation.
+
+# Guide-level explanation
+This proposal aims to integrate LIBXSMM into TVM to accelerate small GEMM and serve as inner-kernel to accelerate normal size GEMM.
+
+We will integrate LIBXSMM with TVM in following 3 components:
+1. Add extern call “tvm.contrib.libxsmm.gemm” in “src/runtime/contrib” directory, and corresponding python interface in "python/tvm/contrib/" directory, so users can call them just as CBLAS; 
+2. Use BYOC to accelerate small GEMM (cube_root(m * n * k ) <= 256) and its epilogue fusion variations (bias/relu/sigmoid/bias_relu/bias_sigmoid);
+3. AutoTVM template we wrote with LIBXSMM as inner kernel into TOPI, as a GEMM implementation candidate.
+
+# Reference-level explanation
+1. Users can call libxsmm as CBLAS through extern call API.
+```
+	def matmul(lhs, rhs, transa=False, transb=False, alpha=1.0, beta=0.0, lda=-1, ldb=-1, ldc=-1, 		**kwargs):
+  		n = lhs.shape[1] if transa else lhs.shape[0]
+  		m = rhs.shape[0] if transb else rhs.shape[1]
+  		return te.extern(
+    		(n, m),
+    		[lhs, rhs],
+    		lambda ins, outs: tvm.tir.call_packed(
+      		"tvm.contrib.libxsmm.matmul", ins[0], ins[1], outs[0], transa, transb, alpha, beta, lda, ldb, ldc),
+    		name="C",
+    		**kwargs,
+  		)
+```
+2. BYOC allows for graph partitioning and using LIBXSMM for code generation.
+	* API to obtain the partitioned function:
+```
+	from tvm.relay.op.contrib import libxsmm
+
+	# API to call LIBXSMM partitioning
+    libxsmm_module = libxsmm.partition_for_cmsisnn(module) 

Review comment:
       typo?

##########
File path: rfcs/0046-Intel-LIBXSMM-integration.md
##########
@@ -0,0 +1,97 @@
+# Summary
+This RFC introduces the plan of integrating LIBXSMM into TVM. LIBXSMM leverages JIT code generator to produce high efficient kernels targeting x86 architectures. 
+
+For details of LIBXSMM, please refer to:
+* [LIBXSMM User Manual](https://libxsmm.readthedocs.io/en/latest/)
+* [LIBXSMM github repo](https://github.com/hfp/libxsmm)
+
+# Motivation
+TVM has shown satisfactory performance on MLP models with CPU. However there are still some defects in the assembly code generated by LLVM which block AutoTVM/AutoScheduler from achieving optimal on GEMM.
+
+LIBXSMM is a open source library developed by Intel Lab for accelerating small matrix multiplication. It leverages the JIT code generator to generate high efficient GEMM kernels for x86 CPU, which could be very close to hardware rootline. According to our evaluation, in “small” GEMM (cube_root(m * n * k) <= 256) , LIBXSMM shows a superior performance over the well-known BLAS library Intel MKL. 
+
+By the way, given that LIBXSMM can generate quite efficient GEMM kernel implementation, it is also an ideal substitution for inner-kernel of normal size GEMM. According our experiments, the AutoTVM templates we wrote with LIBXSMM as register-block generation, has a much higher performance comparing to MKL and existing TOPI implementation.
+
+# Guide-level explanation
+This proposal aims to integrate LIBXSMM into TVM to accelerate small GEMM and serve as inner-kernel to accelerate normal size GEMM.
+
+We will integrate LIBXSMM with TVM in following 3 components:
+1. Add extern call “tvm.contrib.libxsmm.gemm” in “src/runtime/contrib” directory, and corresponding python interface in "python/tvm/contrib/" directory, so users can call them just as CBLAS; 
+2. Use BYOC to accelerate small GEMM (cube_root(m * n * k ) <= 256) and its epilogue fusion variations (bias/relu/sigmoid/bias_relu/bias_sigmoid);
+3. AutoTVM template we wrote with LIBXSMM as inner kernel into TOPI, as a GEMM implementation candidate.
+
+# Reference-level explanation
+1. Users can call libxsmm as CBLAS through extern call API.

Review comment:
       It would be more useful if this RFC could also cover the target system and Relay op strategy so that end users can make use of libxsmm much easier. Specifically, when users specify `llvm -libs=libxsmm`, Relay op strategy automatically lowers corresponding GEMM ops to libxsmm, just like `llvm -libs=cblas`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] zhuwenxi commented on pull request #47: RFC to integrate LIBXSMM with TVM.

Posted by GitBox <gi...@apache.org>.

zhuwenxi commented on pull request #47:
URL: https://github.com/apache/tvm-rfcs/pull/47#issuecomment-996412022


   > Thanks for the RFC and overall LGTM. One suggestion I have is that it would be great if you could provide an upstream plan that briefly explains how would you send PRs. To facilitate the review process, it is encouraged to break down your implementation to a series of small PRs. Here is an example of a PR series:
   > 
   > 1. Add libxsmm to the TVM CI.
   > 2. Add libxsmm to TOPI.
   > 3. Add libxsmm to Relay op strategy.
   > 4. Add libxsmm to BYOC.
   
   Upstream plan added, at the end of RFC.
   
   By the way, I'm not quite sure what "TVM CI" refers to? If it means unit tests, they will be included in their related PRs when I upstream my code. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] zhuwenxi commented on a change in pull request #47: RFC to integrate LIBXSMM with TVM.

Posted by GitBox <gi...@apache.org>.

zhuwenxi commented on a change in pull request #47:
URL: https://github.com/apache/tvm-rfcs/pull/47#discussion_r771079138



##########
File path: rfcs/0046-Intel-LIBXSMM-integration.md
##########
@@ -0,0 +1,97 @@
+# Summary
+This RFC introduces the plan of integrating LIBXSMM into TVM. LIBXSMM leverages JIT code generator to produce high efficient kernels targeting x86 architectures. 
+
+For details of LIBXSMM, please refer to:
+* [LIBXSMM User Manual](https://libxsmm.readthedocs.io/en/latest/)
+* [LIBXSMM github repo](https://github.com/hfp/libxsmm)
+
+# Motivation
+TVM has shown satisfactory performance on MLP models with CPU. However there are still some defects in the assembly code generated by LLVM which block AutoTVM/AutoScheduler from achieving optimal on GEMM.
+
+LIBXSMM is a open source library developed by Intel Lab for accelerating small matrix multiplication. It leverages the JIT code generator to generate high efficient GEMM kernels for x86 CPU, which could be very close to hardware rootline. According to our evaluation, in “small” GEMM (cube_root(m * n * k) <= 256) , LIBXSMM shows a superior performance over the well-known BLAS library Intel MKL. 
+
+By the way, given that LIBXSMM can generate quite efficient GEMM kernel implementation, it is also an ideal substitution for inner-kernel of normal size GEMM. According our experiments, the AutoTVM templates we wrote with LIBXSMM as register-block generation, has a much higher performance comparing to MKL and existing TOPI implementation.
+
+# Guide-level explanation
+This proposal aims to integrate LIBXSMM into TVM to accelerate small GEMM and serve as inner-kernel to accelerate normal size GEMM.
+
+We will integrate LIBXSMM with TVM in following 3 components:
+1. Add extern call “tvm.contrib.libxsmm.gemm” in “src/runtime/contrib” directory, and corresponding python interface in "python/tvm/contrib/" directory, so users can call them just as CBLAS; 
+2. Use BYOC to accelerate small GEMM (cube_root(m * n * k ) <= 256) and its epilogue fusion variations (bias/relu/sigmoid/bias_relu/bias_sigmoid);
+3. AutoTVM template we wrote with LIBXSMM as inner kernel into TOPI, as a GEMM implementation candidate.
+
+# Reference-level explanation
+1. Users can call libxsmm as CBLAS through extern call API.

Review comment:
       1. Relay op strategy support is already in our proposal. It's natural to add Relay op strategy when TOPI integration is done. OK, I'll make it clear in this RFC;
   2. It does make sense to cover the target system, I'll add this part. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] zhuwenxi commented on pull request #47: [RFC] Integrate LIBXSMM with TVM.

Posted by GitBox <gi...@apache.org>.

zhuwenxi commented on pull request #47:
URL: https://github.com/apache/tvm-rfcs/pull/47#issuecomment-1012045073


   @comaniac I'm starting to implement the first PR "Add libxsmm to TVM CI" recently. I wonder if there is any CI-related PR I can refer to?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] zhuwenxi commented on pull request #47: [RFC] Integrate LIBXSMM with TVM.

Posted by GitBox <gi...@apache.org>.

zhuwenxi commented on pull request #47:
URL: https://github.com/apache/tvm-rfcs/pull/47#issuecomment-1012703880


   > > @comaniac I'm starting to implement the first PR "Add libxsmm to TVM CI" recently. I wonder if there is any CI-related PR I can refer to?
   > 
   > You could refer to the PR like [apache/tvm#9881](https://github.com/apache/tvm/pull/9881) or something similar.
   
   Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] zhuwenxi commented on a change in pull request #47: [RFC] Integrate LIBXSMM with TVM.

Posted by GitBox <gi...@apache.org>.

zhuwenxi commented on a change in pull request #47:
URL: https://github.com/apache/tvm-rfcs/pull/47#discussion_r774513767



##########
File path: rfcs/0046-Intel-LIBXSMM-integration.md
##########
@@ -0,0 +1,108 @@
+# Summary
+This RFC introduces the plan of integrating LIBXSMM into TVM. LIBXSMM leverages JIT code generator to produce high efficient kernels targeting x86 architectures. 
+
+For details of LIBXSMM, please refer to:
+* [LIBXSMM User Manual](https://libxsmm.readthedocs.io/en/latest/)
+* [LIBXSMM github repo](https://github.com/hfp/libxsmm)
+
+# Motivation
+TVM has shown satisfactory performance on MLP models with CPU. However there are still some defects in the assembly code generated by LLVM which block AutoTVM/AutoScheduler from achieving optimal on GEMM.
+
+LIBXSMM is a open source library developed by Intel Lab for accelerating small matrix multiplication. It leverages the JIT code generator to generate high efficient GEMM kernels for x86 CPU, which could be very close to hardware rootline. According to our evaluation, in “small” GEMM (cube_root(m * n * k) <= 256) , LIBXSMM shows a superior performance over the well-known BLAS library Intel MKL. 
+
+By the way, given that LIBXSMM can generate quite efficient GEMM kernel implementation, it is also an ideal substitution for inner-kernel of normal size GEMM. According our experiments, the AutoTVM templates we wrote with LIBXSMM as register-block generation, has a much higher performance comparing to MKL and existing TOPI implementation.
+
+# Guide-level explanation
+This proposal aims to integrate LIBXSMM into TVM to accelerate small GEMM and serve as inner-kernel to accelerate normal size GEMM.
+
+We will integrate LIBXSMM with TVM in following 3 components:
+1. Add extern call “tvm.contrib.libxsmm.gemm” in “src/runtime/contrib” directory, and corresponding python interface in "python/tvm/contrib/" directory, so users can call them just as CBLAS;
+2. Use BYOC to accelerate small GEMM (cube_root(m * n * k ) <= 256) and its epilogue fusion variations (bias/relu/sigmoid/bias_relu/bias_sigmoid);
+3. AutoTVM template we wrote with LIBXSMM as inner kernel into TOPI, as a GEMM implementation candidate.
+4. Add target system and Relay op strategy support. When users specify `llvm -libs=libxsmm`, Relay op strategy automatically lowers corresponding GEMM ops to libxsmm.
+
+# Reference-level explanation
+1. Users can call libxsmm as CBLAS through extern call API.
+```python
+  def matmul(lhs, rhs, transa=False, transb=False, alpha=1.0, beta=0.0, lda=-1, ldb=-1, ldc=-1, **kwargs):
+    n = lhs.shape[1] if transa else lhs.shape[0]
+    m = rhs.shape[0] if transb else rhs.shape[1]
+    return te.extern(
+      (n, m),
+      [lhs, rhs],
+      lambda ins, outs: tvm.tir.call_packed(
+        "tvm.contrib.libxsmm.matmul", ins[0], ins[1], outs[0], transa, transb, alpha, beta, lda, ldb, ldc),
+      name="C",
+      **kwargs,
+  )
+```
+2. BYOC allows for graph partitioning and using LIBXSMM for code generation.
+  * API to obtain the partitioned function:
+```python
+  from tvm.relay.op.contrib import libxsmm
+
+  # API to call LIBXSMM partitioning
+  libxsmm_module = libxsmm.partition_for_libxsmm(module) 
+```
+  * Pattern matching table: 
+```python
+  @register_pattern_table("libxsmm")
+  def pattern_table():
+      dense_pattern = ("libxsmm.dense", make_pattern(with_bias=False, with_activation=None))
+      denese_bias_pattern = ("libxsmm.dense_bias", make_pattern(with_bias=True, with_activation=None))
+      denese_relu_pattern = ("libxsmm.dense_relu", make_pattern(with_bias=False, with_activation="relu"))
+      denese_sigmoid_pattern = ("libxsmm.dense_sigmoid", make_pattern(with_bias=False, with_activation="sigmoid"))
+      denese_bias_relu = ("libxsmm.dense_bias_relu", make_pattern(with_bias=True, with_activation="relu"))
+      denese_bias_sigmoid = ("libxsmm.dense_bias_sigmoid", make_pattern(with_bias=True, with_activation="sigmoid"))
+      libxsmm_pattern = [dense_pattern, denese_bias_pattern, denese_relu_pattern, denese_sigmoid_pattern, denese_bias_relu, denese_bias_sigmoid]
+      return libxsmm_pattern
+```
+  * Build with TVM
+```python
+  with tvm.transform.PassContext(opt_level=3):
+    lib = relay.build(libxsmm_module, target="cpu", params=params)
+```
+3. Integrate into TOPI, an GEMM autotvm template with LIBXSMM as inner kernel.
+  * Use Tensorize/TensorIR to substitute register block of GEMM with LIBXSMM
+```python
+  def intrin_func(ins, outs):
+    def _body():
+      ib = tvm.tir.ir_builder.create()
+      ib.emit(
+        tvm.tir.call_extern(
+          "int", "libxsmm_sgemm", m, n, k, 1.0, ins[0].access_ptr("r"), K, ins[1].access_ptr("r"), n, 0.0, outs[0].access_ptr("w"), N
+        )
+      )
+      return ib.get()
+
+    def _update():
+      ib = tvm.tir.ir_builder.create()
+      ib.emit(
+        tvm.tir.call_extern(
+           "int", "libxsmm_sgemm", m, n, k, 1.0, ins[0].access_ptr("r"), K, ins[1].access_ptr("r"), n, 1.0, outs[0].access_ptr("w"), N
+        )
+      )
+      return ib.get()
+```
+
+# Testing
+We will add unittest for coresponding extern call, BYOC and TOPI related code:
+* Make sure the result LIBXSMM produces is correct with its TVM counter part;
+* Confirm match patterns are working as expected.
+
+# Drawbacks
+* Though LIBXSMM works well with AutoTVM, it does not help AutoScheduler;
+* Memory footprint would increase as JIT code generated, a LRU kernel cache might be required to mitigate it. 
+
+# Future possibilities
+* LIBXSMM has DNN support, so it might be interesting to also integrate DNN primitives such as Conv to TVM;
+* LIBXSMM has quantized kernel (int8), we can also integrate it to TVM, as long as it surpass existing oneDNN implementations.
+
+# Upstream plan
+This proposal would be split to following PR series:
+1. Add LIBXSMM as extern call;
+2. Add LIBXSMM to BYOC for accelerating small gemm;
+3. Add LIBXSMM-enabled normal size GEMM to TOPI;
+4. Add LIBXSMM-enabled normal size GEMM to Relay op strategy.

Review comment:
       It does make sense to have the order you suggested. I will update the plan ASAP. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac commented on a change in pull request #47: [RFC] Integrate LIBXSMM with TVM.

Posted by GitBox <gi...@apache.org>.

comaniac commented on a change in pull request #47:
URL: https://github.com/apache/tvm-rfcs/pull/47#discussion_r773420170



##########
File path: rfcs/0046-Intel-LIBXSMM-integration.md
##########
@@ -0,0 +1,108 @@
+# Summary
+This RFC introduces the plan of integrating LIBXSMM into TVM. LIBXSMM leverages JIT code generator to produce high efficient kernels targeting x86 architectures. 
+
+For details of LIBXSMM, please refer to:
+* [LIBXSMM User Manual](https://libxsmm.readthedocs.io/en/latest/)
+* [LIBXSMM github repo](https://github.com/hfp/libxsmm)
+
+# Motivation
+TVM has shown satisfactory performance on MLP models with CPU. However there are still some defects in the assembly code generated by LLVM which block AutoTVM/AutoScheduler from achieving optimal on GEMM.
+
+LIBXSMM is a open source library developed by Intel Lab for accelerating small matrix multiplication. It leverages the JIT code generator to generate high efficient GEMM kernels for x86 CPU, which could be very close to hardware rootline. According to our evaluation, in “small” GEMM (cube_root(m * n * k) <= 256) , LIBXSMM shows a superior performance over the well-known BLAS library Intel MKL. 
+
+By the way, given that LIBXSMM can generate quite efficient GEMM kernel implementation, it is also an ideal substitution for inner-kernel of normal size GEMM. According our experiments, the AutoTVM templates we wrote with LIBXSMM as register-block generation, has a much higher performance comparing to MKL and existing TOPI implementation.
+
+# Guide-level explanation
+This proposal aims to integrate LIBXSMM into TVM to accelerate small GEMM and serve as inner-kernel to accelerate normal size GEMM.
+
+We will integrate LIBXSMM with TVM in following 3 components:
+1. Add extern call “tvm.contrib.libxsmm.gemm” in “src/runtime/contrib” directory, and corresponding python interface in "python/tvm/contrib/" directory, so users can call them just as CBLAS;
+2. Use BYOC to accelerate small GEMM (cube_root(m * n * k ) <= 256) and its epilogue fusion variations (bias/relu/sigmoid/bias_relu/bias_sigmoid);
+3. AutoTVM template we wrote with LIBXSMM as inner kernel into TOPI, as a GEMM implementation candidate.
+4. Add target system and Relay op strategy support. When users specify `llvm -libs=libxsmm`, Relay op strategy automatically lowers corresponding GEMM ops to libxsmm.
+
+# Reference-level explanation
+1. Users can call libxsmm as CBLAS through extern call API.
+```python
+  def matmul(lhs, rhs, transa=False, transb=False, alpha=1.0, beta=0.0, lda=-1, ldb=-1, ldc=-1, **kwargs):
+    n = lhs.shape[1] if transa else lhs.shape[0]
+    m = rhs.shape[0] if transb else rhs.shape[1]
+    return te.extern(
+      (n, m),
+      [lhs, rhs],
+      lambda ins, outs: tvm.tir.call_packed(
+        "tvm.contrib.libxsmm.matmul", ins[0], ins[1], outs[0], transa, transb, alpha, beta, lda, ldb, ldc),
+      name="C",
+      **kwargs,
+  )
+```
+2. BYOC allows for graph partitioning and using LIBXSMM for code generation.
+  * API to obtain the partitioned function:
+```python
+  from tvm.relay.op.contrib import libxsmm
+
+  # API to call LIBXSMM partitioning
+  libxsmm_module = libxsmm.partition_for_libxsmm(module) 
+```
+  * Pattern matching table: 
+```python
+  @register_pattern_table("libxsmm")
+  def pattern_table():
+      dense_pattern = ("libxsmm.dense", make_pattern(with_bias=False, with_activation=None))
+      denese_bias_pattern = ("libxsmm.dense_bias", make_pattern(with_bias=True, with_activation=None))
+      denese_relu_pattern = ("libxsmm.dense_relu", make_pattern(with_bias=False, with_activation="relu"))
+      denese_sigmoid_pattern = ("libxsmm.dense_sigmoid", make_pattern(with_bias=False, with_activation="sigmoid"))
+      denese_bias_relu = ("libxsmm.dense_bias_relu", make_pattern(with_bias=True, with_activation="relu"))
+      denese_bias_sigmoid = ("libxsmm.dense_bias_sigmoid", make_pattern(with_bias=True, with_activation="sigmoid"))
+      libxsmm_pattern = [dense_pattern, denese_bias_pattern, denese_relu_pattern, denese_sigmoid_pattern, denese_bias_relu, denese_bias_sigmoid]
+      return libxsmm_pattern
+```
+  * Build with TVM
+```python
+  with tvm.transform.PassContext(opt_level=3):
+    lib = relay.build(libxsmm_module, target="cpu", params=params)
+```
+3. Integrate into TOPI, an GEMM autotvm template with LIBXSMM as inner kernel.
+  * Use Tensorize/TensorIR to substitute register block of GEMM with LIBXSMM
+```python
+  def intrin_func(ins, outs):
+    def _body():
+      ib = tvm.tir.ir_builder.create()
+      ib.emit(
+        tvm.tir.call_extern(
+          "int", "libxsmm_sgemm", m, n, k, 1.0, ins[0].access_ptr("r"), K, ins[1].access_ptr("r"), n, 0.0, outs[0].access_ptr("w"), N
+        )
+      )
+      return ib.get()
+
+    def _update():
+      ib = tvm.tir.ir_builder.create()
+      ib.emit(
+        tvm.tir.call_extern(
+           "int", "libxsmm_sgemm", m, n, k, 1.0, ins[0].access_ptr("r"), K, ins[1].access_ptr("r"), n, 1.0, outs[0].access_ptr("w"), N
+        )
+      )
+      return ib.get()
+```
+
+# Testing
+We will add unittest for coresponding extern call, BYOC and TOPI related code:
+* Make sure the result LIBXSMM produces is correct with its TVM counter part;
+* Confirm match patterns are working as expected.
+
+# Drawbacks
+* Though LIBXSMM works well with AutoTVM, it does not help AutoScheduler;
+* Memory footprint would increase as JIT code generated, a LRU kernel cache might be required to mitigate it. 
+
+# Future possibilities
+* LIBXSMM has DNN support, so it might be interesting to also integrate DNN primitives such as Conv to TVM;
+* LIBXSMM has quantized kernel (int8), we can also integrate it to TVM, as long as it surpass existing oneDNN implementations.
+
+# Upstream plan
+This proposal would be split to following PR series:
+1. Add LIBXSMM as extern call;
+2. Add LIBXSMM to BYOC for accelerating small gemm;
+3. Add LIBXSMM-enabled normal size GEMM to TOPI;
+4. Add LIBXSMM-enabled normal size GEMM to Relay op strategy.

Review comment:
       I reconsidered this plan again and felt that the step 1, 3, 4 are a group while step 2 is relatively independent. After all, you don't need extern ops to make BYOC work. Thus, I would suggest either (1, 3, 4, 2) or (2, 1, 3, 4).
   
   In addition, what I meant by "TVM CI" was that you will need to add LIBXSMM to the docker environment before merging any changes; otherwise your unit tests won't pass the CI. In summary, I would suggest the following PRs:
   
   1. Add LIBXSMM to TVM CI.
   2. BYOC support.
   3. Documentation about LIBXSMM support, including supported ops/patterns/dtypes/versions and limitations. Ref: https://tvm.apache.org/docs/how_to/deploy/tensorrt.html.
   3. Relay/TOPI op support (this can be at step 2 if you prefer).
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] zhuwenxi commented on pull request #47: [RFC] Integrate LIBXSMM with TVM.

Posted by GitBox <gi...@apache.org>.

zhuwenxi commented on pull request #47:
URL: https://github.com/apache/tvm-rfcs/pull/47#issuecomment-998595632


   @comaniac Any update?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac commented on pull request #47: [RFC] Integrate LIBXSMM with TVM.

Posted by GitBox <gi...@apache.org>.

comaniac commented on pull request #47:
URL: https://github.com/apache/tvm-rfcs/pull/47#issuecomment-1012508045


   > @comaniac I'm starting to implement the first PR "Add libxsmm to TVM CI" recently. I wonder if there is any CI-related PR I can refer to?
   
   You could refer to the PR like https://github.com/apache/tvm/pull/9881 or something similar.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac merged pull request #47: [RFC] Integrate LIBXSMM with TVM.

Posted by GitBox <gi...@apache.org>.

comaniac merged pull request #47:
URL: https://github.com/apache/tvm-rfcs/pull/47


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac commented on pull request #47: [RFC] Integrate LIBXSMM with TVM.

Posted by GitBox <gi...@apache.org>.

comaniac commented on pull request #47:
URL: https://github.com/apache/tvm-rfcs/pull/47#issuecomment-1000933605


   Thanks @zhuwenxi this is now merged.
   Also please file a follow-up PR to add the RFC information (start date, RFC PR and RFC tracking issue). You can refer to other merged RFCs for details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] zhuwenxi commented on pull request #47: [RFC] Integrate LIBXSMM with TVM.

Posted by GitBox <gi...@apache.org>.

zhuwenxi commented on pull request #47:
URL: https://github.com/apache/tvm-rfcs/pull/47#issuecomment-1000249547


   @comaniac Thank you, I've updated the plan, please let me know if there is still problems. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org